A Malware Detection Method
A Malware Detection Method
A PROJECT REPORT ON
BACHELOR OF TECHNOLOGYIN
Submitted By
Mrs.V.RAMA LAKSHMI
Assistant Professor
2
POTTI SRIRAMULU CHALAVADI MALLIKHARJUNARAOCOLLEGE OF
CERTIFICATE
MR.S.SRIRAM (20KT1A05B8) Fulfillment for the award of the degree of Bachelor of Technology in
COMPUTER SCIENCE AND ENGINEERING of Jawaharlal Nehru Technological University, Kakinada
during the year 2020-2024. It is certified that all corrections/suggestions indicated for internal assessment
have been incorporated in the report. The project report has been approved as it satisfies the academic
requirements in respect of project work prescribed for the above degree
EXTERNAL EXAMINER
3
ACKNOWLEDGEMENTS
I owe a great many thanks to a great many people who helped and supported and suggested
me in every step.
I am glad for having the support of our principal Dr. J. LAKSHMI NARAYANA who
inspired me with his words filled with dedication and discipline towards work.
I express my gratitude towards Dr.D.DURGA PRASAD, Professor & HoD of CSE for
extending his support through training classes which had been the major source to carry out my
project.
Finally, I thank one and all who directly and indirectly helped me to complete my project
successfully.
Project Associate
S. SRIRAM (20KT1A05B8)
4
DECLARATION
Project Associate
S. SRIRAM
(20KT1A05B8)
5
ABSTRACT
Traditional signature-based malware detection approaches are
sensitive to small changes in the malware code. Currently, most malware
programs are adapted from existing programs. Hence, they share some
common patterns but have different signatures. To health sensor data, it
is necessary to identify the malware pattern rather than only detect the
small changes. However, to detect these health sensor data in malware
programs timely, we propose a fast detection strategy to detect the
patterns in the code with machine learning-based approaches. In
particular, XGBoost, LightGBM and Random Forests will be exploited
in order to analyze the code from health sensor dataTerabytes of program
with labels, including benign and malware programs, have been collected.
The challenges of this task are to select and get the features, modify the
three models in order to train and test the dataset, which consists of health
sensor data, and evaluate the features and models. When a malware
program is detected by one model, its pattern will be broadcast to the other
models, which will prevent malware program from intrusion effectively.
6
TABLE OF CONTENTS
1.INTRODUCTION 1
1.1 Motivation 1
1.2 Existing System 2
1.3 Objective 2
1.4 Outcomes 2
1.5 Applications
3
7
5.1 Flowchart 34
5.2 Code 34
6.TESTING 45
7.RESULTS AND DISCUSSIONS 53
8.CONCLUSION AND FUTURE SCOPE 59
9REFERENCES 60
LIST OF FIGURES
S.NO NAME P.NO
1 Project SDLC 3
2 Use case diagram 31
3 Class diagram 32
4 Sequence diagram 33
8
1.INTRODUCTION
With the advent of the Internet of Things Era, all kinds of sensors are applied
to collect health sensor data. Inevitably, some malware or malicious codes concealed
in health sensor data, which are considered as intrusion in the target host computer,
are executed according to the logic prescribed by a hacker. The categories of malicious
codes in health sensor data include computer viruses, worms, trojans, botnets,
ransomware and so on [1]. Malware attacks can steal core data and sensitive
information and damage computer systems and networks. It is one of the greatest
threats to today's computer security [2, 3]. The method of performing malware
analysis is usually one of two types [4-7]. (1) Static analysis is usually accomplished
by demonstrating the different resources of a binary file without implementing it and
studying each component. Binary files can also be disassembled (or redesign) using a
disassembler (such as IDA). Machine code can sometimes be interpreted into
assembly code, and humans can read and understand assembly code. Malware analysts
can understand assembly instructions and get an image of what the program should
execute. Some modern malware is created using ambiguous techniques to defeat this
type of analysis, such as embedding grammatical code errors. These errors can confuse
the disassembler, but they still work in the actual execution. (2) Dynamic analysis is
performed by observing how the malware actually behaves when it runs on the host 1
This work was supported by the Qatar National Research Fund (a member of the Qatar
Foundation) under Grant NPRP10-1205-160012. The statements made herein are
solely the responsibility of the authors. system. Modern malware can encompass a
variety of ambiguous techniques that are designed to overcome dynamic analysis,
including testing virtual environments or active debuggers, delaying the execution of
malicious payloads, or requiring some form of interactive user input [8- 10]. In this
paper, we mainly focus on static code analysis. The early static code analysis methods
mainly include feature matching or broad-spectrum signature scanning. Feature
matching simply uses feature string matching to complete the detection, while the
broad-spectrum scanning scans the feature code and uses masked bytes to divide the
sections that need to be compared and those that do not need to be compared. Since
both methods need to get malware samples and extract features before they can be
detected, the hysteresis problem is serious. Furthermore, with the development of
malware technology, malware begins to deform in the transmission process in order
to avoid being found and killed, and there is a sudden increase in the number of
malware variants. The shape of the variants changes a lot so that it is difficult to extract
a piece of code as a malware signature..
1
1.1MOTIVATION
we mainly focus on static code analysis. The early static code analysis
methods mainly include feature matching or broad-spectrum signature scanning.
Feature matching simply uses feature string matching to complete the detection,
while the broad-spectrum scanning scans the feature code and uses masked bytes
to divide the sections that need to be compared and those that do not need to be
compared. Since both methods need to get malware samples and extract features
before they can be detected, the hysteresis problem is serious. Furthermore, with
the development of malware technology, malware begins to deform in the
transmission process in order to avoid being found and killed, and there is a sudden
increase in the number of malware variants. The shape of the variants changes a lot
so that it is difficult to extract a piece of code as a malware signature.
1.2 Existing System
. Based on this situation, a natural idea is to apply machine learning-
based methods that use existing experience and knowledge to perform static
code analysis on unknown binary code and automatically classify malware.
According to the guidance, this paper uses the related technologies of
machine learning based methods and explores the application of this method
in the classification of malware
1.2.1 Limitations of existing system
• Must need basicknowledge to perform static code analysis on unknown
binary code and automatically classify malware
1.3 Objectives
The objective of project is A Malware Detection Method for Health
Sensor Data Based on Machine Learning
1.5Applications
It can be used in detecting malwares
2
1.6 STRUCTURE OF PROJECT (SYSTEM ANALYSIS)
2.Data Preprocessing
4.Modiling
5.Predicting
• Usability requirement
• Serviceability requirement
• Manageability requirement
• Recoverability requirement
• Security requirement
• Data Integrity requirement
• Capacity requirement
• Availability requirement
• Scalability requirement
• Interoperability requirement
• Reliability requirement
• Maintainability requirement
• Regulatory requirement
• Environmental requirement
2.LITERATURE SURVEY
Based on this situation, a natural idea is to apply machine learning-based
methods that use existing experience and knowledge to perform static code
analysis on unknown binary code and automatically classify malware.
According to the guidance, this paper uses the related technologies of
6
machine learning based methods and explores the application of this method
in the classification of malware [11-14]. The essence of malware detection
is a classification problem, which distinguishes the samples to be detected
into malware or legitimate software. Therefore, the host malware detection
technology is driven by a machine learning algorithm’s core steps, and the
main research steps of this paper are as follows: Collect sufficient malware
code samples and legitimate software samples. Perform effective data
processing on the sample and extract the features. Further select the main
features for classification. Combine the training using machine learning
algorithms and establish a classification model. Detect unknown samples
using the trained classification model. The ultimate goal is to find the most
effective features and models in this practical task. This chapter introduces
the main research questions and basic ideas. In the following, we will
introduce:
[1]S. Su, Y. Sun, X. Gao, J. Qiu* and Z. Tian*. A Correlation-change
based Feature Selection Method for IoT Equipment Anomaly
Detection. Applied Sciences.
In the era of the fourth industrial revolution, there is a growing trend to
deploy sensors on industrial equipment, and analyze the industrial
equipment’s running status according to the sensor data. Thanks to the rapid
development of IoT technologies [1], sensor data could be easily fetched
from industrial equipment, and analyzed to produce further value for
industrial control at the edge of the network or at data centers. Due to the
considerable development of deep learning in recent years, a common
practice of such analysis is to conduct deep learning [2,3,4]. Such methods
select a subset of all fetched sensor data stream as the input features, and
generate equipment predictions. As a result, the performance of the learning
model was seriously impacted by the features selected, thus feature selection
plays a critical role for such methods.
To select an appropriate set of features for the learning model,
researchers aim to select the most relevant features to the prediction model
to improve the prediction performance, or to select the most informative
features to conduct data reduction. Unfortunately, both kinds of methods
7
have intrinsic drawbacks when applied in the online scenarios. The former
kind of methods seriously depends on predefined evaluation criteria, such
as feature relevance metrics [5] or a predefined learning model [6]. Thus,
such method are limited to certain dataset, and are not suitable for online
scenarios which involve dynamical and unsupervised feature selection. The
later kind of methods right fits in the online scenarios. However, data
reduction mainly aims to improve the efficiency (but not accuracy) of the
prediction model, which is not the most concerning factor of online
industrial equipment status analysis.
To relieve the dependency of predefined evaluation criteria,
researchers switch to select the features which can indicate the online sensor
data’s characters, such as features which are smoothest on the graph [7], or
the features with highest clusterability [8,9]. In this paper, we focus on the
features with correlation changes such as smoothness and clusterability,
which are important characters for traditional pattern recognition fields like
image processing and voice recognition [7,8,9]. We believe that correlation
changes can significantly pinpoint status changes in industrial environment.
As far as we know, this is the first work focusing on correlation changes for
online feature selection.
8
intentionally or accidentally. And that it is impossible to prevent from
spreading once the confidential information has leaked.
Although laws and regulations have been passed to punish various behaviors
of intentional data leakage, it is still hard to prevent data leakage effectively.
Confidential data can be easily disguised by rephrasing confidential contents
or embedding confidential contents in nonconfidential contents [5, 6]. In
order to avoid the problems arising from data leakage, lots of software and
hardware solutions have been developed which are discussed in the
following chapter.
9
[3]Y. Sun, M. Li, S. Su, Z. Tian, W. Shi, M. Han. Secure Data Sharing
Framework via Hierarchical Greedy Embedding in Darknets.
ACM/Springer Mobile Networks an
10
personalized for specifc populations. In this paper, we propose a
recommendation method based on dynamic charging area mechanism,
which recommends the appropriate initial charging area according to the
user’s warning level, and dynamically changes the charging area according
to the real-time state of EVs and charging piles. The recommendation
method based on a classifcation chain provides more personalized services
for users according to diferent charging needs and improves the utilization
ratio of charging piles. This satisfes users’ multilevel charging demands and
realizes a more efective charging planning, which is benefcial to overall
balance. The chained recommendation method mainly consists of three
modules: intention detection, warning levels classifcation, and chained
recommendation. The dynamic charging area mechanism reduces the
occurrence of recommendation confict and provides more personalized
service for users according to diferent charging needs. Simulations and
computations validate the correctness and efectiveness of the proposed
method. Keywords Electric vehicle · Recommendation confict · Chained
recommendation · Dynamic charging area mechanism Mathematics Subject
Classifcation 68W40 * Yu Jiang [email protected] Extended author
information available on the last page of the article T. Zhang et al. 1 3 1
Introduction Recommendation for resources and services is a classic
problem. Especially in the current era of big data and cloud computing,
recommendation systems are widely used in a variety kinds of felds.
Traditional recommendation algorithms are used in shopping, reading,
catering, accommodation, and other felds, which brings our daily lives great
convenience. With the increasing popularity of new energy vehicles, the
demands of charging electric vehicles (EVs) are becoming increasingly
obvious. However, as a relatively new application feld, the development of
algorithms on recommendation of charging piles is not addressed well
enough to meet the increasing demands. Compared with the traditional gas
pile, the charging pile has the characteristics of longer charging time, less
stable price and limited service capacity. In addition, the charging pile and
the vehicle are required to be matched on some parameters. From users’
point of view, they usually fnd it difcult to make choices or they are simply
11
too lazy to make such decisions. Thus, most of them tend to follow other
people’s decisions blindly and gather at the most popular charging piles,
which may lead to the unbalanced utilization of charging resources. All of
the above mentioned bring great challenges to the recommendation of
charging piles [1]. The EV industry is growing rapidly, and the government
is also vigorously promoting the construction of charging infrastructure. As
of January 2019, the ratio of public charging piles to new energy vehicles in
China is about 1:7.6. Owners of EVs can select the idle electric piles or make
an appointment for charging through applications developed for
recommendation. However, due to the lack of suitable dispatching method
in existing charging pile platforms, owners need to choose idle charging
piles independently [2]. On one hand, it leads to a poor user experience.
Users have to make decisions to pick a charging pile independently to charge
or reserve from a large number of charging piles that can meet their
conditions. On the other hand, it generates lots of time fragmentation, which
could reduce the utilization rate of charging piles [3]. Recently, both
industrial and academic communities started to have great interest to EVs
and charging pile deployment. The mainly studied issues are the siting of
charging piles and the recommendation of EVs. For example, Tian et al. [4]
provided a real-time charging pile recommendation system for EV taxis via
large-scale GPS data mining. Jung et al. [5] used an activity-based model to
analyze the queue delay of charge piles and ofer decision support for
choosing locations of undeployed charging piles. Besides, Gharbaoui et al.
[6] also used activity-based models and found that in urban areas, public
charging piles can be under-utilized and location selecting of charging piles
should be considered to reduce EV owners’ range anxiety [7, 8]. Mobility,
high density, sparse connectivity, and heterogeneity bring spatial challenges
for the vehicular Internet of Things. For emerging vehicular IoT
applications, distributed communication, data caching, and computing tasks
are conducted to provide more reliable and efcient communications in
various network environments [9, 10]. Edge computing provides high-class
intelligent services and computing capabilities at the edge of the networks,
and the constrained shortest distance (CSD) querying can also be applied to
12
recommendation algorithms [11, 12]. Feng et al. [13] 1 3 A method
of chained recommendation for charging piles in internet… introduced a
distributed vehicular edge computing solution named the autonomous
vehicular edge (AVE), which can share neighboring vehicles’ available
resources via vehicle-to-vehicle (V2V) communications. Feng et al. [14]
designed an ant colony optimization algorithm to schedule bufers based on
information collected on an adjacent vehicle. While the rapid development
of IoT devices is changing our daily lives, some particular issues hinder the
massive deployment of IoT devices. The cloud-based malware detection can
utilize the data sharing and powerful computational resources of secured
servers to improve the detection performance. Such methods provide good
technical supports for charging piles recommendation [15, 16]. The state of
charging piles and EVs are changing all the time. The recommendation
method based on charging intention takes charging needs of surrounding
users into account to make a more accurate recommendation list for the
served user. However, this method is not applicable for users with urgent
needs. To satisfy more users, it is vital to provide a personalized
recommendation list. And diferent from books, movies, and other items,
charging piles and EVs have strong regional characteristics. Users live or
work in a certain area are used to choose charging piles nearby. The
recommendation method based on preference can provide users with
recommendation lists in line with their charging habits, but this method is
not applicable to newly installed charging piles and new users. The
recommendation of charging piles is a research direction of great value.
However, most of the existing researches related to such recommendation
tend to have some shortcomings [17]. On one hand, the existing methods
consider inadequate reference factors, which lead to the uncertainty and
unavailability of the recommendation results, and afect the accuracy of
recommendation. On the other hand, the existing recommendation methods
are generalized for all users, rather than personalized for specifc
populations. People nowadays have partiality for personalized services,
because those services can customize the most suitable solutions for users
of diferent needs [18]. In this paper, we propose a chained recommendation
13
method for charging piles, which can reduce the occurrence of
recommendation confict. We can understand that, based on his/her location
data and charging history, whether a user has charging intentions or not. If
the user has charging intentions, he/she is marked as “a user to be charged”.
Based on the user’s profle, the endurance data of the EVs, and the charging
piles distribution in the area, we divide the “users to be charged” into three
warning levels, and screen the target charging areas for users based on their
warning levels. Once an user enters the target charging area, the
recommendation list of charging piles is generated and is sent to the user.
This method can satisfy users’ multilevel charging demands and improve
the utilization ratio of charging piles. It can realize a more efective allocation
of charging piles, which is benefcial to overall balance of resources
utilization. The novelty of our method is mainly refected in two aspects: On
the one hand, it applies chained recommendation mechanism to make
recommendations for users’ with diferent charging needs. On the other hand,
dynamic charging area mechanism is designed to detect and alleviate the
recommendation confict to a certain extent even eliminate entirely. The
organization of this paper is as follows. In Sect. 2, we describe the
construction of the recommendation model according to diferent conditions.
In particular, we introduce three sub- modules including intention detection
module, warning levels classifcation module, and chained recommendation
module in detail. In Sect. 3, we simulate and verify the feasibility of our
method. In the last Sect. 4, we conclude the paper and suggest future work.
2 Recommandation modeling The recommendation model consists modules
of intention detection, warning levels classifcation, and chained
recommendation, as shown in Figs. 1 and 2. The intention detection module
mainly includes two sub-modules: data acquisition and state probability
calculation, and is responsible for detecting user’s charging intentions. The
warning levels classifcation module is based on the situation of EVs and
charging piles, surrounding environment, users’ profle, and so on. It is
responsible for dividing the warning levels of users into three categories:
high-class, mediumclass, and low-class warning levels. The warning levels
classifcation module aims to prepare for the subsequent chained
14
recommendation module. Warning Levels Classification Module Intention
Detection Module Dynamic Charging Area Mechanism Chained
Recommendation Mechanism Fig. 1 Modules of the recommendation model
Fig. 2 Recommendation model 1 3 A method of chained recommendation
for charging piles in internet… The chained recommendation module is
mainly based on the dynamic charging area mechanism and the chained
recommendation mechanism. According to diferent warning levels, it is
divided into three sub-modules. By referring to a variety of factors, it
provides users with diferent levels of recommendation services, and
generates the fnal lists of recommended charging piles [19–23]. At the same
time, according to the dynamic parameters of vehicles’ battery status,
driving information and charging demands, our system adjusts the chained
recommendation process and the recommendation strategy, and gives the
real-time recommendation results of charging piles, so as to optimize the
charging efciency and utilization. 2.1 Intention detection module Based on
the user’s current GPS location information and charging history, the
intention detection module is responsible for making detection about the
user’s current charging intention. There are three types of the charging
intentions: on the way to charging pile (S1), has reached charging pile (S2),
and has no charging intention (S3) [4]. • Data Acquisition: it is responsible
for acquiring user’s charging history, as well as user’s current GPS location
information. • State Probability Calculation: it is responsible for statistical
analysis of data, and the specifc approach is as follows. First, we divide one
day into several periods, and then count the number of days that a user is in
a certain state at a certain time based on the charging history. For example,
by analyzing the charging history of the user in the past month (30 days),
we can detect the charging intention of the user at T-time [24]. It can be
found that in the past month, the number of days that the user is in state S1
at T-time is 5 days ( =5), the number of days in state S2 is 5 days ( =5), and
the number of days in state S3 is 20 days ( =20). This implies that the user
has spent the most days in the state of S3 in the past month, so it can be
predicted that the user’s charging intention may be in the state of S3 at T-
time, which means that the user has no charging intention at T-time.
15
Probability calculation formula of intention detection is as follows: where
N(Si ) denotes the total number of days that the EV is in the state of Si at T-
time. In addition, charging intention is detected on the basis of the following
formula: P (1) ( Si ) = N ( Si ) N ( S1 ) + N ( S2 ) + N ( S3 ) , N ( Si ) > 0
Maximum[P (2) ( Si ) ] T. Zhang et al. 1 3 2.2 Warning levels classifcation
module 2.2.1 User’s profle It is necessary to judge whether the user belongs
to VIP users or ordinary users. Under the pattern of pay-for-service model,
users can become VIP users by paying a certain fee, which enhances their
priority of recommendation services. For VIP users, it is considered that
they have the privilege of obtaining priority recommendation. Their warning
level is considered to be high-class, and we recommend high-quality and
convenient piles to them frst [25]. For ordinary users, the warning level is
divided based on situations of vehicles, piles, and surrounding environment.
2.2.2 Surrounding environment According to weather, air conditioning,
congestion idling, climbing, and other scenario information, we make
assumptions about the power consumption rate of batteries, and predict the
time left on the remaining power. 2.2.3 Situation of piles and EVs It is
necessary to classify warning levels based on the remaining power, ratio of
vehicles to piles, density, and other elements. It should be noted that, the
less residual electricity that vehicles have, the more urgent the charging
demand is, and the higher charging priority is. Besides, we pay more
attention to areas with lower distribution density of charging piles. In such
areas strengthened recommendations and reminders are necessary. 2.3
Chained recommendation module 2.3.1 Dynamic charging area mechanism
When recommending charging piles, we may encounter situations where we
recommend the same charging pile to two or more users [26, 27]. Under
some circumstances, it leads to recommendation confict if the number of
charging piles is insufcient. To mitigate recommendation confict, the
dynamic charging area mechanism is applied. Under the dynamic charging
area mechanism, we recommend the dynamic charging area frstly and the
specifc charging piles subsequently. First of all, an initial charging area is
recommended for users. The recommendation system continuously
refreshes the optimized recommendation list while the EV proceeds to the
16
recommended charging pile. As the constraints changes continuously, the
recommendation list changes accordingly [28, 29]. That is to say, the
chained recommendation mechanism described in the following section
provides algorithmic support for the dynamic charging area mechanism. 1 3
A method of chained recommendation for charging piles in internet…
When the user arrives at the recommended charging area, our system
recommends a charging pile located in the recommended charging area for
the user. Thus, even if the same charging pile is recommended to two or
more users and the number of charging piles is insufcient, our system can
re-recommend suitable charging piles that are in the same charging area and
close to users. The recommendation confict detection is introduced into the
dynamic charging area mechanism to detect whether there is a
recommendation confict at each time node or location node. It is responsible
for detecting whether a charge pile in the recommendation list generated for
users has been occupied in advance. In the case of the occurrence of
recommendation confict, the recommendation list is regenerated through
chained recommendation mechanism [30–34]. The dynamic charging area
mechanism shown in Algorithm 1 adopts the idea of retrospective
recommendation. Even in the absence of recommendation confict, it
continuously detects if there is a more suitable charging pile based on the
change of user’s location and time, and it updates the recommendation list
in real time. By adopting this mechanism, our recommendation method is
responsible for the recommendation results, and the continuous tracking is
also available A gorithm 1 Dynamic Charging Area Mechanism input The
initial set of charging areas; while Appropriate time node or location node
is detected do if The EV arrives at the recommended charging area then
Generate the recommendation list; break; else if Updating is needed then
Update the set of charging areas; end if end while return The
recommendation list; . 2.3.2 Chained recommendation mechanism It is
responsible for making hierarchical recommendation and providing
integrated and fair services for users with diferent warning levels. By the
fusion of multiple models, it can improve the accuracy of recommendation.
2.3.2.1 Level-one fltering For users at high-class warning levels, their
17
charging needs are more urgent. We give these users the highest priority and
recommend charging piles for them in a most efcient way. In the level-one
fltering, the most convenient recommendation for users is just based on
some basic elements such as users’ current location information, EVs’
profle, the occupancy status of charging piles, and the parking space of
charging piles, etc. T. Zhang et al. 1 3 2.3.2.2 Level-two fltering For users
at medium-class warning levels, the level-two fltering is applied to
recommend charging piles for them. After the level-one fltering, an initial
set of charging piles is prepared. The charging piles in this set should meet
some requirements, such as appropriate distance (Too close or too far is not
appropriate [35]), appropriate charging rate, and matching parameters
between charging piles and EVs. Afterwards, we introduce the improved
collaborative fltering algorithm in the level-two fltering. It integrates users’
preferences, waiting time, price, and other factors to make personalized,
socialized, and economical recommendation results for users. • Calculating
users’ preference for diferent charging piles: By analyzing users’ historical
charging behavior and combining the current distance between users and
charging piles, we can calculate users’ preference for diferent charging piles,
and the calculation formula is as follows: where: pi ∈ ⟨p1, p2, …, pi ⟩
represents the current position of the i-th EV; staj ∈ ⟨ sta1,sta2, …,staj ⟩
represents the j-th charging pile in the initial set of charging piles; cij ∈ ⟨
c11, …, c1j, c21, …, cij⟩ represents the total number of charging times that
the i-th EV has spent at the j-th charging pile; dist( pi ,staj ) represents the
real-time distance between the i-th EV with the j-th charging pile at the
current moment. In addition, recommendation based on preference is
measured on the basis of the following formula: • Calculating the waiting
time that the user choose diferent charging piles: In general, the waiting time
is equal to the sum of time spent on the way to diferent charging piles and
time spent on charging. If other users arrive at the designated charging pile
frst and all charging piles are occupied, in this case, the waiting time is equal
to the sum of time spent on the way to diferent charging piles, time spent on
charging, and the queuing time (the remaining charging time of the previous
18
user). The criterion for recommendation based on waiting time is to select
charging pile with the shortest waiting time.
19
a novel computing paradigm, called edge computing, has been proposed as
a promising solution [4]. A number of modest-size computing servers have
been deployed at the edge of pervasive radio access networks close to users,
so that users can offload their computing tasks to these servers with low
latency. Although the edge-based computation offloading approach can
significantly augment computation capability of users, developing a
comprehensive and reliable edge computing system remains challenging.
Edge servers have limited hardware resources. If too many users choose to
offload their tasks simultaneously, it would exceed the capacity of edge
servers, leading to long task response time. Therefore, it is critical to design
an effi- • Y. Zhan and S. Guo are with the Department of Computing, The
Hong Kong Polytechnic University, Hong Kong. E-mail:
[email protected], [email protected]. • P. Li is with School of
Computer Science and Engineering, The University of Aizu, Japan. E-mail:
[email protected]. • J. Zhang is with School of Automation, Beijing
Institute of Technology, Beijing, China. E-mail: [email protected].
cient offloading strategy to decide which tasks of users should be offloaded
to edge servers. This problem has been recognized as one of the most critical
challenges for edge computing, but most existing work needs centralized
control to achieve global optimal performance [5, 6]. Unfortunately, it is not
practical to enforce all users to act according to a centralized control because
they are individuals with rational choices in computation offloading. Game
theory is a powerful framework to analyze the interactions among multiple
players who act in their own interests. It can be used to design decentralized
mechanisms, such that no player has the incentive to deviate unilaterally.
Thanks to its great promises, game theory has been applied for designing
offloading algorithm for edge computing by recent research efforts. For
example, Chen et al. [7, 8] have designed a decentralized computation
offloading game for mobile cloud computing. Josilo et ˇ al. [9] have
proposed selfish decentralized computation offloading in dense wireless
networks where each user can offload its computation to multiple wireless
base stations. However, existing work can hardly be applied in practice
because of two weaknesses. First, they consider a discrete action model that
20
allows users to choose a limited number of actions. Although this model
works well in scenarios with a few users, it cannot handle large-scale
problems. A straightforward approach is to add more actions in the problem
formulation, but it leads to higher algorithm complexity. Second, existing
work has a strong assumption that all users should share their information,
e.g., quality of network connection and preference on energy efficiency, so
that they can make the best offloading decisions. However, users may be
Authorized licensed use limited to: Northwestern University. Downloaded
on May 03,2020 at 12:20:46 UTC from IEEE Xplore. Restrictions apply.
0018-9340 (c) 2019 IEEE. Personal use is permitted, but
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html
for more information. This article has been accepted for publication in a
future issue of this journal, but has not been fully edited. Content may
change prior to final publication. Citation information: DOI
10.1109/TC.2020.2969148, IEEE Transactions on Computers 2 unwilling
to expose such personal information due to privacy and security concerns.
In this paper, we study to conquer above weaknesses by designing an
algorithm based on game theory enhanced by deep reinforcement learning
(DRL). Specifically, we consider a number of users who can connect an
edge server via multiple access points (e.g., base stations or WiFi routers).
Each user can arbitrarily divide its task into smaller subtasks and choose to
offload a portion of them to the edge server. A challenge arises because of
partial offloading. It makes the model more flexible, but users should choose
their actions from a continuous space, which is different from discrete
models used by existing work that considers simple offloading decisions,
e.g., offloading the whole task or not [10]. We first study a simple scenario
that users share their information, e.g., network bandwidth and preference,
and design an algorithm that is able to achieve Nash equilibrium. Based on
the insight provided by this algorithm, we then extend our work for
scenarios without information sharing. The problem is formulated as a
multi-agent partially observable Markov decision process (POMDP). To
address the challenges of network dynamics and continuous decision space,
21
we propose a decentralized approach based on deep reinforcement learning
(D-DRL) with policy gradient and differential neural computer (DNC). Our
approach can effectively learn the optimal offloading policy under high
network dynamics in a continuous decision space directly from computation
offloading game history without any prior knowledge about system models.
It has merits over model-based computation offloading game strategies in
that it is totally model-free and provides a general solution to computation
offloading problems. Thus, it can be applied to complex and unpredictable
situations where it is difficult to obtain the precise system models.
Moreover, DNC which is first used in policy gradient DRL is capable of
remembering past information and inferring the hidden states of
observations automatically. By incorporating the DNC into our framework,
not only the policy optimization process will be accelerated significantly,
but also the users can learn policy when the network is time-varying and
uncertain. The main contributions of this paper are summarized as follows:
• We study the task offloading problem in edge computing and formulate it
as a decentralized computation offloading game in each time slot by taking
into account both communication and computation cost. We solve this
problem by proposing an algorithm that can achieve Nash equilibrium. • We
study the offloading problem without information sharing and formulate it
as a multi-agent POMDP. An algorithm based on DRL and DNC has been
proposed to solve this challenging problem. • Simulation results
demonstrate effectiveness of the proposed scheme by comparing it with
state-of-theart. The edge computing paradigm has attracted considerable
attention in both academia and industry over the past several years. Nokia
introduced the very first realworld edge computing platform in 2013 [11],
in which the computing platform called radio application cloud servers is
fully integrated with the Flexi Multiradio base stations. Saguna also
introduced their fully virtualized edge computing platform Open-RAN,
which can provide an open environment for running third-party edge
computing applications [12]. Currently, the industry specifications group
was formed to standardize the adoption of edge computing within the RAN
[4]. Many existing work has studied the computation offloading problem
22
from the perspective of a single user. Redenko et al. [13] have shown that
computation offloading can save energy according to their experimental
results. In [14], an optimization scheme for energy-efficient application
execution has been proposed on the cloudassisted mobile application
platform. Xian et al. [15] have proposed an adaptive timeout scheme for
computation offloading to improve the energy saving. Huertacanepa et al.
[16] have proposed an adaptive application offloading scheme based on
current system conditions and the execution history of applications. Based
on Lyapunpov optimization, authors in [17] and [18] have studied the
dynamic computation offloading mechanism for minimizing computational
and communication energy consumption under real network environment.
There are some works that have investigated the computation offloading
problem in the multi-user case. Rodrigues et al. [6] have proposed a hybrid
method for minimizing service delay in edge computing through virtual
machine migration and transmission power control. Yang et al. [19] have
proposed a genetic algorithm to solve the partition problem of wireless
network bandwidth among multiple users, which achieves high throughput
of processing the streaming data. In [20], Zhao et al. have proposed a low-
complexity heuristic method to implement energy-efficient task offloading
for multi-user mobile cloud computing. In [21], an iterative algorithm has
been proposed to perform the joint optimization of radio and computational
resources for multi-cell edge computing under the budget constraints of
latency and power. You et al. [22] have studied a centralized offloading
framework for a multi-user edge computing system based on TDMA and
OFDMA aiming to minimize the Authorized licensed use limited to:
Northwestern University. Downloaded on May 03,2020 at 12:20:46 UTC
from IEEE Xplore. Restrictions apply. 0018-9340 (c) 2019 IEEE. Personal
use is permitted, but republication/redistribution requires IEEE permission.
See
http://www.ieee.org/publications_standards/publications/rights/index.html
for more information. This article has been accepted for publication in a
future issue of this journal, but has not been fully edited. Content may
change prior to final publication. Citation information: DOI
23
10.1109/TC.2020.2969148, IEEE Transactions on Computers 3 user’s
energy consumption. Guo et al. [23] have provided an energy-efficient
dynamic offloading and resource scheduling policy to reduce energy
consumption and shorten application completion time. Consensus protocol
in blockchain networks is a computation-intensive process, which makes the
computationally lightweight nodes such as the mobile devices may be
prevented from directly participating in the consensus process. Xiong et al.
[24] have proposed the mining tasks offloading approach to alleviate such
limitation. However, all above works need centralized control, ignoring the
interactions among multiple users when they independently determine their
computation offloading strategies. Some recent works [7, 8, 25–28] have
modeled users as self-interested game players and proposed decentralized
schemes to solve the multi-user computation offloading problems.
However, they mainly focus on the computation offloading problems under
relatively static environment. In real network environment, due to the time-
varying wireless networks, the utility of each user is dynamically changing,
and thus the solution of the Nash equilibrium in the static game model may
not be reached. In [29], authors take into account the timevariant wireless
network, and model the computation offloading game as a stochastic game.
They assume that dedicated edge computing resources are allocated to each
user, so that users do not need to compete for computational resources at the
edge. However, this strong assumption is not practical in real computational
environment, and would lead to low utilization of edge computation
resources. Xiao et al.[30] have proposed the multi-user computation
offloading problem in timevariant wireless networks, and each user needs to
compete the computational resource. A Q-learning based approach has been
proposed to achieve the Nash equilibrium of the dynamic computation
offloading game. However, the users’ decision space is discrete in their
model and the proposed approach has high complexity in solving large-scale
problems. It is challenging to achieve Nash equilibrium in the stochastic
games in the decentralized and dynamic environment. Multi-agent Nash Q-
Learning [31] has been proposed for discrete stochastic game. Lillicrap et
al. [32] have proposed the DDPG approach for the multi-agent Markov
24
decision process, where the environment is fully observable. In [33],
Srinivasan et al. have modeled the games as a partially observable Markov
decision process (POMDP), and examined the role of current policy gradient
and actor-critic algorithms. However, they focused on adversarial games. In
contrast to the previous research, our work in this paper formally addresses
the problem of partial computation offloading, dynamic environment and
incomplete information sharing in edge computing. This is a non-trivial
problem due to that each user could only obtain partial observation and thus
could not derive the optimal decision.
3. PROBLEM ANALYSIS
3.1 EXISTING APPROACH:
. Based on this situation, a natural idea is to apply machine learning-based
methods that use existing experience and knowledge to perform static code
analysis on unknown binary code and automatically classify malware.
According to the guidance, this paper uses the related technologies of
machine learning based methods and explores the application of this method
in the classification of malware
.
3.11Drawbacks
Must need basicknowledge to perform static code analysis on
unknown binary code and automatically classify malware
25
3.2 Proposed System
In this paper, we mainly focus on static code analysis. The early static
code analysis methods mainly include feature matching or broad-
spectrum signature scanning. Feature matching simply uses feature
string matching to complete the detection, while the broad-spectrum
scanning scans the feature code and uses masked bytes to divide the
sections that need to be compared and those that do not need to be
compared. Since both methods need to get malware samples and extract
features before they can be detected, the hysteresis problem is serious.
Furthermore, with the development of malware technology, malware
begins to deform in the transmission process in order to avoid being
found and killed, and there is a sudden increase in the number of
malware variants. The shape of the variants changes a lot so that it is
difficult to extract a piece of code as a malware signature.
3.2.1 Advantages
• simply uses feature string matching to complete the detection, while the
broad-spectrum scanning done both comparison and un-comparison
SOFTWARE REQUIREMENTS
The functional requirements or the overall description documents
include the product perspective and features, operating system and operating
environment, graphics requirements, design constraints and user
documentation.
The appropriation of requirements and implementation constraints
gives the general overview of the project in regards to what the areas of
strength and deficit are and how to tackle them.
26
• Google colab
HARDWARE REQUIREMENTS
Minimum hardware requirements are very dependent on the particular
software being developed by a given Enthought Python / Canopy / VS Code
user. Applications that need to store large arrays/objects in memory will
require more RAM, whereas applications that need to perform numerous
calculations or tasks more quickly will require a faster processor.
• Operating system : windows, linux
• Processor : minimum intel i3
• Ram : minimum 4 gb
• Hard disk : minimum 250gb
27
3.5 Algorithms
XGBoost, LightGBM and Random Forests
4. SYSTEM DESIGN
UML DIAGRAMS
The System Design Document describes the system requirements, operating
environment, system and subsystem architecture, files and database design,
input formats, output layouts, human-machine interfaces, detailed design,
processing logic, and external interfaces.
Global Use Case Diagrams:
Identification of actors:
Actor: Actor represents the role a user plays with respect to the system.
An actor interacts with, but has no control over the use cases.
Graphical representation:
28
<<Actor name>>
Actor
29
o Who affects the system? Or, which user groups are needed by the system
to perform its functions? These functions can be both main functions
and secondary functions such as administration.
o Which external hardware or systems (if any) use the system to perform
tasks?
o What problems does this application solve (that is, for whom)?
o And, finally, how do users use the system (use case)? What are they
doing with the system?
The actors identified in this system are:
a. System Administrator
b. Customer
c. Customer Care
Identification of usecases:
Usecase: A use case can be described as a specific way of using the
system from a user’s (actor’s) perspective.
Graphical representation:
30
• For each actor, find the tasks and functions that the actor should be able
to perform or that the system needs the actor to perform. The use case should
represent a course of events that leads to clear goal
• Name the use cases.
• Describe the use cases briefly by applying terms with which the user is
familiar. This makes the description less ambiguous
Questions to identify use cases:
• What are the tasks of each actor?
• Will any actor create, store, change, remove or read information in the
system?
• What use case will store, change, remove or read this information?
• Will any actor need to inform the system about sudden external
changes?
• Does any actor need to inform about certain occurrences in the system?
• What usecases will support and maintains the system?
Flow of Events
A flow of events is a sequence of transactions (or events) performed by the
system. They typically contain very detailed information, written in terms
of what the system should do, not how the system accomplishes the task.
Flow of events are created as separate files or documents in your favorite
text editor and then attached or linked to a use case using the Files tab of a
model element.
A flow of events should include:
• When and how the use case starts and ends
• Use case/actor interactions
• Data needed by the use case
• Normal sequence of events for the use case
• Alternate or exceptional flows Construction of Usecase diagrams:
Use-case diagrams graphically depict system behavior (use cases). These
diagrams present a high level view of how the system is used as viewed from
an outsider’s (actor’s) perspective. A use-case diagram may depict all or
some of the use cases of a system.
A use-case diagram can contain:
31
• actors ("things" outside the system)
• use cases (system boundaries identifying what the system should do)
• Interactions or relationships between actors and use cases in the system
including the associations, dependencies, and generalizations.
Relationships in use cases:
1. Communication:
The communication relationship of an actor in a usecase is shown by
connecting the actor symbol to the usecase symbol with a solid path. The
actor is said to communicate with the usecase.
2. Uses:
A Uses relationship between the usecases is shown by generalization
arrow from the usecase.
3. Extends:
The extend relationship is used when we have one usecase that is similar to
another usecase but does a bit more. In essence it is like subclass.
SEQUENCE DIAGRAMS
A sequence diagram is a graphical view of a scenario that shows object
interaction in a time- based sequence what happens first, what happens
next. Sequence diagrams establish the roles of objects and help provide
essential information to determine class responsibilities and interfaces.
There are two main differences between sequence and collaboration
diagrams: sequence diagrams show time-based object interaction while
collaboration diagrams show how objects associate with each other. A
sequence diagram has two dimensions: typically, vertical placement
represents time and horizontal placement represents different objects.
Object:
An object has state, behavior, and identity. The structure and behavior of
similar objects are defined in their common class. Each object in a diagram
indicates some instance of a class. An object that is not named is referred to
as a class instance.
The object icon is similar to a class icon except that the name is
underlined: An object's concurrency is defined by the concurrency of its
class.
32
Message:
A message is the communication carried between two objects that trigger
an event. A message carries information from the source focus of control
to the destination focus of control. The synchronization of a
message can be modified through the
message specification. Synchronization means a message where
the sending object pauses to wait for results.
Link:
A link should exist between two objects, including class utilities, only if
there is a relationship between their corresponding classes. The existence
of a relationship between two classes symbolizes a path of communication
between instances of the classes: one object may send messages to another.
The link is depicted as a straight line between objects or objects and class
instances in a collaboration diagram. If an object links to itself, use the
loop version of the icon.
CLASS DIAGRAM:
Identification of analysis classes:
A class is a set of objects that share a common structure and common
behavior (the same attributes, operations, relationships and semantics). A
class is an abstraction of real-world items. There are 4 approaches for
identifying classes:
a. Noun phrase approach:
b. Common class pattern approach.
c. Use case Driven Sequence or Collaboration approach.
d. Classes , Responsibilities and collaborators Approach
1. Noun Phrase Approach:
The guidelines for identifying the classes:
• Look for nouns and noun phrases in the usecases.
• Some classes are implicit or taken from general knowledge.
• All classes must make sense in the application domain; Avoid
computer implementation classes – defer them to the design stage.
• Carefully choose and define the class names After identifying the
classes we have to eliminate the following types of classes:
33
• Adjective classes.
2. Common class pattern approach:
The following are the patterns for finding the candidate classes:
• Concept class.
• Events class.
• Organization class
• Peoples class
• Places class
• Tangible things and devices class.
3. Use case driven approach:
We have to draw the sequence diagram or collaboration diagram. If there is
need for some classes to represent some functionality then add new classes
which perform those functionalities.
4. CRC approach:
The process consists of the following steps:
• Identify classes’ responsibilities ( and identify the classes )
• Assign the responsibilities
• Identify the collaborators. Identification of responsibilities of each
class:
The questions that should be answered to identify the attributes and methods
of a class respectively are:
a. What information about an object should we keep track of?
b. What services must a class provide? Identification of relationships
among the classes:
Three types of relationships among the objects are:
Association: How objects are associated?
Super-sub structure: How are objects organized into super classes and sub
classes? Aggregation: What is the composition of the complex classes?
Association:
The questions that will help us to identify the associations are:
a. Is the class capable of fulfilling the required task by itself?
b. If not, what does it need?
c. From what other classes can it acquire what it needs? Guidelines for
34
identifying the tentative associations:
• A dependency between two or more classes may be an association.
Association often corresponds to a verb or prepositional phrase.
35
are transitivity and anti symmetry.
The questions whose answers will determine the distinction between the
part and whole relationships are:
• Does the part class belong to the problem domain?
• Is the part class within the system’s responsibilities?
36
• Does the part class capture more than a single value?( If not then
simply include it as an attribute of the whole class)
• Does it provide a useful abstraction in dealing with the problem
domain? There are three types of aggregation relationships. They are:
Assembly:
It is constructed from its parts and an assembly-part situation physically
exists.
Container:
A physical whole encompasses but is not constructed from physical parts.
Collection member:
A conceptual whole encompasses parts that may be physical or conceptual.
The container and collection are represented by hollow diamonds but
composition is represented by solid diamond.
37
USE CASE DIAGRAM
A use case diagram in the Unified Modeling Language (UML) is a
type of behavioral diagram defined by and created from a Use-case analysis.
Its purpose is to present a graphical overview of the functionality provided
by a system in terms of actors, their goals (represented as use cases), and
any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor.
Roles of the actors in the system can be depicted.
Start
Data Processing
Run Algorithm
Accuracy Graph
Exit
38
CLASS DIAGRAM
In software engineering, a class diagram in the Unified
Modeling Language (UML) is a type of static structure diagram that
describes the structure of a system by showing the system's classes, their
attributes, operations (or methods), and the relationships among the classes.
It explains which class contains information.
39
SEQUENCE DIAGRAM
User System
Data Processing
Run Algorithm
Accuracy Graph
Accuracy Graph
40
5.IMPLEMENTATION
1.3 Architecture
41
5.4 Code
main = tkinter.Tk()
main.title("Android Malware Detection")
main.geometry("1300x1200")
global filename
global train
global svm_acc, nn_acc, svmga_acc, annga_acc
global X_train, X_test, y_train, y_test
global svmga_classifier
42
global nnga_classifier
global svm_time,svmga_time,nn_time,nnga_time
def upload():
global filename
filename =
filedialog.askopenfilename(initialdir="dataset")
pathlabel.config(text=filename)
text.delete('1.0', END)
text.insert(END,filename+" loaded\n");
def generateModel():
global X_train, X_test, y_train, y_test
text.delete('1.0', END)
train = pd.read_csv(filename)
rows = train.shape[0] # gives number of row count
cols = train.shape[1] # gives number of col count
features = cols - 1
print(features)
X = train.values[:, 0:features]
Y = train.values[:, features]
print(Y)
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size = 0.2, random_state = 0)
43
text.insert(END,"Splitted Training Length :
"+str(len(X_train))+"\n");
text.insert(END,"Splitted Test Length :
"+str(len(X_test))+"\n\n");
def runSVM():
global svm_acc
global svm_time
start_time = time.time()
text.delete('1.0', END)
44
cls = svm.SVC(C=2.0,gamma='scale',kernel = 'rbf',
random_state = 2)
cls.fit(X_train, y_train)
prediction_data = prediction(X_test, cls)
svm_acc = cal_accuracy(y_test, prediction_data,'SVM
Accuracy')
svm_time = (time.time() - start_time)
def runSVMGenetic():
text.delete('1.0', END)
global svmga_acc
global svmga_classifier
global svmga_time
estimator = svm.SVC(C=2.0,gamma='scale',kernel = 'rbf',
random_state = 2)
svmga_classifier = GeneticSelectionCV(estimator,
cv=5,
verbose=1,
scoring="accuracy",
max_features=5,
n_population=50,
crossover_proba=0.5,
mutation_proba=0.2,
n_generations=40,
crossover_independent_proba=0.5,
mutation_independent_proba=0.05,
tournament_size=3,
n_gen_no_change=10,
caching=True,
45
n_jobs=-1)
start_time = time.time()
svmga_classifier = svmga_classifier.fit(X_train, y_train)
svmga_time = svm_time/2
prediction_data = prediction(X_test, svmga_classifier)
svmga_acc = cal_accuracy(y_test, prediction_data,'SVM
with GA Algorithm Accuracy, Classification Report &
Confusion Matrix')
def runNN():
global nn_acc
global nn_time
text.delete('1.0', END)
start_time = time.time()
model = Sequential()
model.add(Dense(4, input_dim=215, activation='relu'))
model.add(Dense(215, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=50, batch_size=64)
_, ann_acc = model.evaluate(X_test, y_test)
nn_acc = ann_acc*100
text.insert(END,"ANN Accuracy : "+str(nn_acc)+"\n\n")
nn_time = (time.time() - start_time)
def runNNGenetic():
global annga_acc
46
global nnga_time
text.delete('1.0', END)
train = pd.read_csv(filename)
rows = train.shape[0] # gives number of row count
cols = train.shape[1] # gives number of col count
features = cols - 1
print(features)
X = train.values[:, 0:100]
Y = train.values[:, features]
print(Y)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X,
Y, test_size = 0.2, random_state = 0)
model = Sequential()
model.add(Dense(4, input_dim=100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
start_time = time.time()
model.fit(X_train1, y_train1)
nnga_time = (time.time() - start_time)
_, ann_acc = model.evaluate(X_test1, y_test1)
annga_acc = ann_acc*100
text.insert(END,"ANN with Genetic Algorithm Accuracy
: "+str(annga_acc)+"\n\n")
def graph():
height = [svm_acc, nn_acc, svmga_acc, annga_acc]
47
bars = ('SVM Accuracy','NN Accuracy','SVM Genetic
Acc','NN Genetic Acc')
y_pos = np.arange(len(bars))
plt.bar(y_pos, height)
plt.xticks(y_pos, bars)
plt.show()
def timeGraph():
height = [svm_time,svmga_time,nn_time,nnga_time]
bars = ('SVM Time','SVM Genetic Time','NN Time','NN
Genetic Time')
y_pos = np.arange(len(bars))
plt.bar(y_pos, height)
plt.xticks(y_pos, bars)
plt.show()
48
uploadButton = Button(main, text="Upload Android
Malware Dataset", command=upload)
uploadButton.place(x=50,y=100)
uploadButton.config(font=font1)
pathlabel = Label(main)
pathlabel.config(bg='brown', fg='white')
pathlabel.config(font=font1)
pathlabel.place(x=460,y=100)
49
nngaButton = Button(main, text="Run Neural Network
with Genetic Algorithm", command=runNNGenetic)
nngaButton.place(x=50,y=200)
nngaButton.config(font=font1)
50
6.TESTING
6.1 SOFTWARE TESTING
Testing
51
structure. Tests are based on requirements and functionality.
Unit Testing
Integration Testing
Beta Testing
Performance Testing
52
Fig.:Black Box Testing
When applied to machine learning models, black box testing would mean
testing machine learning models without knowing the internal details such
as features of the machine learning
model, the algorithm used to create the model etc. The challenge, however,
is to verify the test outcome against the expected values that are known
beforehand.
The above Fig.4.2 represents the black box testing procedure for machine
learning algorithms.
53
Table.4.1:Black box Testing
[16,6,324,0,0,0,22,0,0,0,0,0,0] 0 0
[16,7,263,7,0,2,700,9,10,1153,832, 1 1
9,2]
The model gives out the correct output when different inputs are given
which are mentioned in Table 4.1. Therefore the program is said to be
executed as expected or correct program
Testing
Testing is a process of executing a program with the aim of finding error. To make our
software perform well it should be error free. If testing is done successfully it will
remove all the errors from the software.
54
3. Unit testing
4. Integration Testing
5. Alpha Testing
6. Beta Testing
7. Performance Testing and so on
Unit Testing
Integration Testing
The phase in software testing in which individual software modules are combined
and tested as a group. It is usually conducted by testing teams.
Alpha Testing
Beta Testing
Final testing before releasing application for commercial purpose. It is typically done
by end- users or others.
Performance Testing
55
engineer.
When applied to machine learning models, black box testing would mean testing
machine learning models without knowing the internal details such as features of the
machine learning
model, the algorithm used to create the model etc. The challenge, however, is to
verify the test outcome against the expected values that are known beforehand.
The above Fig.4.2 represents the black box testing procedure for machine learning
56
algorithms.
[16,6,324,0,0,0,22,0,0,0,0,0,0] 0 0
[16,7,263,7,0,2,700,9,10,1153,832,9,2] 1 1
The model gives out the correct output when different inputs are given which are
mentioned in Table 4.1. Therefore the program is said to be executed as expected
or correct program
Test Test Case Test Case Test Steps Test Test
Cas Name Description Step Expected Actual Case Priorit
e Id Statu Y
s
57
application.
03 User Verify the If it We The High High
Mode working of doesn’t cannot application
The Respond use the displays the
application Freestyle Freestyle
in freestyle mode. Page
Mode
04 Data Input Verify if the If it fails We The High High
application to take the cannot application
takes input input or proceed updates the
and updates store in further input to
application
The
Database
58
7.RESULTS AND DISCUSSIONS
59
Upload the data and read the basic data information will be shown on the screen
60
4. Now click on “Train and Test model”. split the data into train and test and traain will be used
for training and to tets the performace we are using test data
61
Train And Test Model
Run Algorithm
Accuracy Graph
5. Now click on “Run Algoruimns”. Mentioned algorithms will be run on the data
62
Navie bayes algorithm is performed better
Extension is Navie Bayes and perfromed well compare to other algorithms
8.CONCLUSION
With the increasing complexity of malware codes concealed in health sensor data [27-30, 38, 40],
the application of machine learning algorithms in the detection of malicious code has been
increasingly valued by the academic community and numerous security vendors. Based on the
theory of machine learning, this paper combines the advantages of different models [31-33, 36-
37] and discusses the static code analysis based on different machine learning algorithms and
different code features. This work can provide referential value for the future design and
implementation of malware detection technology for machine learning [34]. However, this area
63
still belongs to the developmental stage. There are still many future tasks and challenges and they
are summarized below. 1. Lack of valuable data: A machine learning algorithm often requires tens
of thousands of data [35] to be trained in order to get an effective model. The acquisition of these
basic data often requires manual operations and the speed cannot be guaranteed [36, 37]. 2.
FUTURE WORK
Lack of interpretable results: The internal reason is that for many features, we only know that they
are effective and do not know why. The interpretation of this issue will be the most important
challenge for the future.
64
8.BIBILOGRAPHY
[1] L. Wu, X. Du, W. Wang, B. Lin, “An Out-of-band Authentication Scheme for Internet of
Things Using Blockchain Technology,” in Proc. of IEEE ICNC 2018, Maui, Hawaii, USA, March
2018.
[2] M. Shen, B. Ma, L. Zhu, R. Mijumbi, X. Du, and J. Hu, “Cloud-Based Approximate
Constrained Shortest Distance Queries over Encrypted Graphs with Privacy Protection”, IEEE
Transactions on Information Forensics & Security, Volume: 13, Issue: 4, Page(s): 940 – 953, April
2018, DOI: 10.1109/TIFS.2017.2774451.
[3] P. Dong, X. Du, H. Zhang, and T. Xu, “A Detection Method for a Novel DDoS Attack against
SDN Controllers by Vast New Low-Traffic Flows,” in Proc. of the IEEE ICC 2016, Kuala Lumpur,
Malaysia, 2016. [4] Z. Tian, Y. Cui, L. An, S. Su, X. Yin, L. Yin and X. Cui. A Real-Time
Correlation of Host-Level Events in Cyber Range Service for Smart Campus. IEEE Access. vol.
6, pp. 35355-35364, 2018. DOI: 10.1109/ACCESS.2018.2846590.
[5] Q. Tan, Y. Gao, J. Shi, X. Wang, B. Fang, and Z. Tian. Towards a Comprehensive Insight into
the Eclipse Attacks of Tor Hidden Services. IEEE Internet of Things Journal. 2018. DOI:
10.1109/JIOT.2018.2846624. [6] Z. Wang, C. Liu, J. Qiu, Z. Tian, C., Y. Dong, S. Su
Automatically Traceback RDP-based Targeted Ransomware Attacks. Wireless Communications
and Mobile Computing. 2018. https://doi.org/10.1155/2018/7943586.
[7] L. Xiao, Y. Li, X. Huang, X. Du, “Cloud-based Malware Detection Game for Mobile Devices
with Offloading”, IEEE Transactions on Mobile Computing, Volume: 16, Issue: 10, Pages: 2742
– 2750, Oct. 2017. DOI: 10.1109/TMC.2017.2687918.
[8] https://en.wikipedia.org/wiki/Malware_analysis
[9] Z. Tian, W. Shi, Y. Wang, C. Zhu, X. Du, et al., “Real-Time Lateral Movement Detection
Based on Evidence Reasoning Network for Edge Computing Environment”, IEEE Transactions
on Industrial Informatics, Volume: 15, Issue: 7, Page(s): 4285 – 4294, March 2019.
[10]L. Xiao, X. Wan, C. Dai, X. Du, X. Chen, M. Guizani, “Security in mobile edge caching with
reinforcement learning”, IEEE Wireless Communications Volume: 25, Issue: 3, pp. 116-122, June
2018, DOI: 10.1109/MWC.2018.1700291.
65
66