0% found this document useful (0 votes)
206 views

Malware - Detection - Using - Machine - Learning (3) - Removed

The document is a seminar report on malware detection using machine learning. It provides an introduction to malware, including different types and methods of analysis. It also discusses traditional approaches to malware detection like signature-based, heuristic-based, and behavioral-based methods. The report introduces machine learning concepts and algorithms relevant to malware detection, like supervised/unsupervised learning and support vector machines. It presents case studies on applying machine learning for malware detection in antivirus software and on mobile devices. Finally, it discusses limitations of machine learning for malware detection and potential ways to overcome them.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views

Malware - Detection - Using - Machine - Learning (3) - Removed

The document is a seminar report on malware detection using machine learning. It provides an introduction to malware, including different types and methods of analysis. It also discusses traditional approaches to malware detection like signature-based, heuristic-based, and behavioral-based methods. The report introduces machine learning concepts and algorithms relevant to malware detection, like supervised/unsupervised learning and support vector machines. It presents case studies on applying machine learning for malware detection in antivirus software and on mobile devices. Finally, it discusses limitations of machine learning for malware detection and potential ways to overcome them.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

A

Seminar Report
On
Malware Detection Using Machine
Learning
Submitted to the Department of Electronics Engineering in Partial Fulfilment for the
Requirements for the Degree of

Bachelor of Technology
(Electronics and Communication)

by

MANISH KUMAR SINGH


(U21EC123)
(B. TECH. III(EC), 5th Semester)

Guided by

Dr J. N. PATEL
Associate Professor, DECE

DEPARTMENT OF ELECTRONICS ENGINEERING


SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY
NOV-2023
Sardar Vallabhbhai National Institute Of Technology
Surat - 395 007, Gujarat, India

DEPARTMENT OF ELECTRONICS ENGINEERING

CERTIFICATE
This is to certify that the SEMINAR REPORT entitled “Malware Detection Us-
ing Machine Learning” is presented & submitted by Candidate MANISH KUMAR
SINGH , bearing Roll No.U21EC123, of B.Tech. III, 5th Semester in the partial fulfill-
ment of the requirement for the award of B.Tech. Degree in Electronics & Communica-
tion Engineering for academic year 2023 - 24.
He has successfully and satisfactorily completed his/her Seminar Exam in all re-
spects. We certify that the work is comprehensive, complete and fit for evaluation.

Name of Examiners Signature with Date

1.
2.

Dr J. N. SARVAIYA
Head & Professor Seal of The Department
DECE, SVNIT (November 2023)

Dr. J. N. PATEL
Associate Professor & Seminar Guide
Acknowledgements
I would like to express my profound gratitude and deep regards to my guide Dr.J.N.
Patel for his guidance. I am heartily thankful for suggestion and the clarity of the
concepts of the topic that helped me a lot for this work.
I would also like to thank Dr. J. N. SARVAIYA, Head of the Electronics Engi-
neering Department, SVNIT and all the faculties of ECED for their co-operation and
suggestions.
Finally, I would like to thank my classmates for their love, support and encourage-
ment during this challenging period. They have always been there for me and helped
me overcome any difficulties.

Manish Kumar Singh


Sardar Vallabhbhai National Institute of Technology
Surat

Nov 27 , 2023

v
Abstract
This report delves into the critical domain of malware detection by exploring the in-
tegration of machine learning techniques. The study encompasses various aspects of
malware, their classification, and the evolution of traditional detection methods, as well
as an in-depth examination of machine learning approaches. The objective is to pro-
vide an extensive overview of the challenges, benefits, and real-world applications of
machine learning in the realm of malware detection.

In the initial chapters, we elucidate the diverse landscape of malware, including


static, dynamic, and memory-damaging malware, while also discussing the various
types of malware, such as viruses, trojans, and spyware etc . Understanding these
aspects is crucial for building a strong foundation in the pursuit of effective malware
detection.

Moving forward, the report examines the traditional approaches to malware de-
tection. It highlights signature-based, heuristic-based, and behavioral-based detection
methods, elucidating their strengths and limitations. By gaining insights into these
conventional methods, we set the stage for a comprehensive exploration of machine
learning.

Chapter 3 focuses on machine learning, providing an introduction to its core con-


cepts. Supervised and unsupervised learning, Support Vector Machines (SVM), and
Ensemble Methods are among the topics covered in this section. This knowledge equips
readers with the necessary tools to understand how machine learning can be leveraged
in the context of malware detection.

Chapter 4 delves into case studies and real-world applications that showcase the
practicality and efficacy of applying machine learning to malware detection. These
real-world examples offer a glimpse into the tangible impact of this technology in the
field of cybersecurity.

The report concludes by summarizing the key takeaways from each chapter and
emphasizing the importance of harnessing the power of machine learning to combat
the evolving threat landscape of malware. This report serves as a valuable resource for
security professionals, researchers, and anyone interested in understanding the dynamic
landscape of malware detection and the pivotal role played by machine learning in this
field.

vii
Table of Contents
Page
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapters
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Scope and Significance . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Sope: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Significance: . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Definition and Classification of Malware . . . . . . . . . . . . . . . . . 3
1.2.1 Static Malware Analysis . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Dynamic Malware Analysis . . . . . . . . . . . . . . . . . . . 3
1.2.3 Memory-damaging Malware Analysis . . . . . . . . . . . . . . 3
1.3 Common Types of Malware . . . . . . . . . . . . . . . . . . . . . . . 4
2 Traditional Approaches to Malware Detection . . . . . . . . . . . . . . . . . 7
2.1 Signature-based Detection . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Benefits of signature-based detection: . . . . . . . . . . . . . . 7
2.1.2 Challenges of signature-based detection: . . . . . . . . . . . . . 8
2.2 Heuristic-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Techniques used in heuristic-based detection: . . . . . . . . . . 9
2.2.2 Benefits of heuristic-based detection: . . . . . . . . . . . . . . 9
2.2.3 Challenges of heuristic-based detection: . . . . . . . . . . . . . 9
2.3 Behavioral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Some of the most common techniques for behavioral analysis . 10
2.3.2 Benefits of Behavioral-based detection: . . . . . . . . . . . . . 10
2.3.3 Challenges of Behavioral-based detection: . . . . . . . . . . . . 10
3 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Basics of Machine Learning Algorithms . . . . . . . . . . . . . . . . . 11
3.2 Supervised, Unsupervised, and Reinforcement Learning . . . . . . . . 11
3.3 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Non-Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Types of Ensemble Methods . . . . . . . . . . . . . . . . . . . 13

ix
Table of Contents

3.4.2 Ensemble methods can be used to improve the performance of


machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Case Studies and Real-world Applications . . . . . . . . . . . . . . . . . . . 17
4.1 Malware Detection in Antivirus Software . . . . . . . . . . . . . . . . 17
4.2 Mobile Malware Detection . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Benefits of using machine learning for malware detection . . . . 18
4.3 Limitations of Machine Learning in Malware Detection . . . . . . . . . 19
4.3.1 Things that can be done to reduce the limitations of machine
learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Summary and Future scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

x
List of Figures

1.1 Types of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Work flow diagram of the proposed system . . . . . . . . . . . . . . . 8

3.1 Support Vector Machines (SVM) [1] . . . . . . . . . . . . . . . . . . . 14


3.2 Proposed ensemble method for classification of both imbalanced bal-
anced malaria disease datasets. [2] . . . . . . . . . . . . . . . . . . . . 16

4.1 Proposed ML malware detection method [3] . . . . . . . . . . . . . . . 18

xi
List of Abbreviations
ML Machine Learning
AI Artificial Intelligence
API Application programming interface
DLL Dynamic Link Library
IDA Interactive Disassembler
CAT Computer-Aided Translation
APK Android Application Package
SVM Support Vector Machine
2D Two Dimension
XAI Explainable Artificial Intelligence

xv
Chapter 1
INTRODUCTION
MALWARE is defined as software designed to infiltrate or damage a computer system
without the owner’s informed consent. Malware is actually a generic definition for all
kind of computer threats. A simple classification of malware consists of file infectors
and stand-alone malware. Another way of classifying malware is based on their partic-
ular action: worms, backdoors, trojans, rootkits, spyware, adware etc.
Malware detection is a challenging task due to the rapid evolution of malware and
the increasing sophistication of attacks. Traditional signature-based detection methods
are no longer sufficient to protect against new and emerging malware. Machine learning
(ML) offers a promising approach to malware detection, as it can be used to learn the
characteristics of malware and distinguish them from benign files [4].
ML-based malware detection systems typically extract features from malware and
benign files. These features can be based on a variety of factors, such as the file’s size,
entropy, opcode distribution, and API calls. Once the features have been extracted, an
ML algorithm is used to train a model that can classify files as malware or benign.
Once the model is trained, it can be used to scan new files for malware. If the model
predicts that a file is malware, it can be quarantined or deleted. ML-based malware
detection systems have several advantages over traditional signature-based methods.
First, ML systems can be more effective at detecting new and emerging malware, as
they do not rely on a database of known malware signatures. Second, ML systems
can be more adaptive to changes in malware, as they can be retrained on new data as
needed [4].
However, ML-based malware detection systems also have some challenges. One
challenge is that ML systems can be computationally expensive to train and deploy.
Another challenge is that ML systems can be vulnerable to adversarial attacks, where
attackers craft malware specifically designed to fool ML systems.
Despite these challenges, ML-based malware detection systems are becoming in-
creasingly widely used. Many commercial antivirus products now use ML to detect
malware. Additionally, researchers are actively developing new ML-based malware
detection techniques.
Malware detection through standard, signature based methods is getting more and
more difficult since all current malware applications tend to have multiple polymorphic
layers to avoid detection or to use side mechanisms to automatically update themselves
to a newer version at short periods of time in order to avoid detection by any antivirus
software. For an example of dynamical file analysis for malware detection, via emula-
tion in a virtual environment, the interested reader can see [5].

1
Chapter 1. INTRODUCTION

1.1 Scope and Significance

1.1.1 Sope:

The International Data Corporation states that Android’s market share has grown to 136
million units and 75percent of the market share in 3Q 2012. Bloomberg Businessweek
reports there are 700,000 applications available on Google Play as of October 29, 2012.
While most of these applications are benign, TrendLabs reports in their 3Q 2012 se-
curity roundup that the top ten installed malicious Android applications had 71, 520
installations total.
A key open challenge is the lack of large-scale studies that use hundreds or thou-
sands of mobile malware samples to analyze the effectiveness of well-known machine-
learning algorithms. Studies that provide empirical results from large-scale experiments
are needed to focus research and understand the effectiveness of current algorithms.
Key challenges include acquiring thousands of malware samples, orchestrating the in-
stall→profile→clean cycle, and training the machine learning classifiers [5].

1.1.2 Significance:

To fill this gap in research on large-scale evaluations of machine learning algorithms


for mobile malware detection, we present results from studying 6 machine learning
classifiers on 1330 malicious and 408 benign applications, for a total of 1738 unique
applications. By sending a large amount of emulated user input to each application, we
collected 6,832 feature vectors.
We analyzed classifier performance with cross validation (e.g. analysis performed
on applications in the training set) and with true testing (e.g. analysis performed on
applications not in the training set). Although cross validation frequently implies better
malware classifier performance (e.g. is biased), it is used in many dynamic malware
papers and therefore included here for comparison to prior results. The empirical re-
sults provide an initial roadmap for researchers selecting machine learning algorithms
for mobile malware classification and a resource for validating the improvements of
new malware classification approaches. Our results show that Logistic and Bayes net
malware classifiers perform the worst and best respectively at detecting malware. More-
over, our datasets, empirical results, and distributed experimentation infrastructure are
available in opensource form [5].

2
1.2. Definition and Classification of Malware

1.2 Definition and Classification of Malware


Malware, or malicious software, is any program or file that is intentionally harmful to a
computer, network or server.
There are three types of Malware Analyis: Static, Dynamic, and Memory Malware
Analysis

1.2.1 Static Malware Analysis


Static analysis consists of examining the code or structure of the executable file with-
out executing it. This kind of analysis can confirm whether a file is malicious, provide
information about is functionality and can also be used to produce a simple set of sig-
natures. For instance, the most common method used to uniquely identify a malicious
program is hashing [6].
The detection method or checking of the malicious code without executing it is
called a static malware review. It is a malware analysis focus ed on a signature. Dormant
malware is extracted from and used in the collection or functionality extraction process
in the machine classification, such as metadata chains, code, and import libraries. The
static malware analysis file type may most likely be exe, DLL, documents, assembly
code, byte code, etc.; static features are extracted from these file types as the output.
For static malware analysis, tools like PEiD, ssdeep, pafish, Yara, strings, IDA Pro,
OllyDbg, OllyDump, and many more can be used [7].

1.2.2 Dynamic Malware Analysis


To eliminate the malware or stop it from spreading to other systems, dynamic analysis
helps by running the malware sample and analyses its activity on the system. The
dynamic malware analysis is used to delete the malicious code’s dynamical features,
including CWSandbox, Anubis, CAT, TRACKTRAK, etc.

1.2.3 Memory-damaging Malware Analysis


The test procedure of the spiteful code after it has been executed is known as a memory-
damaging malware analysis. Memory analysis features include shared resources, appli-
cation programs, hooking detection, network services, rootkits link, hidden objects,
injection code, etc. Memory analytical resources include volatility, pin tools, Valgrind,
etc.
This survey aims to review and systematize existing literature to promote malware
analysis using machine learning techniques.

3
Chapter 1. INTRODUCTION

Figure 1.1: Types of Malware

1.3 Common Types of Malware


It is helpful to identify the problem so that malware methodologies and reasoning are
better understood. Depending on its function, malware may be divided into different
groups.
The types of malware are
Virus: This is the simplest type of software. It is just any software piece that
has been loaded, launched, and repeated (modified) without user permission or other
software [7].
Worm: This form of malware is very much like a virus. The difference is that the
worm will propagate to other machines across the network [7].
Trojan: This malware class is used to describe malware types that can appear as
legitimate software. Thus, social engineering is the general propagation vector used in
this class, making people trust that they download legitimate apps [7].
Adware: The only aim of this type of malware is to display computer ads. Adware
may also be viewed as a spyware subset and its aim is to create revenue for developers
[7].
Spyware: As the name suggests, spyware can call the malware that allows spyware.
Typical spyware practices include monitoring the search history to transmit custom ad-
vertising to third parties and tracking activities to sell it after that [7].
Rootkit: Its functionality allows the intruder with higher permissions to access data
than it is permitted. For example, offer administrative access to an unauthorized user.
Rootkits are constantly hidden and sometimes unnoticed, making them unbelievably
challenging to find and remove.
Backdoor: A backdoor is a kind of malware, which lets attackers in a different way
to access the device. Instead, it does not harm itself but provides attackers with a wider
surface. Backdoors are so rarely utilised individually. They usually occur before other

4
1.3. Common Types of Malware

malware attacks occur.


Keylogger: This malware class aims at logging all user-pressed keys and thus stor-
ing the data, including passwords, bank card numbers, and other vulnerable data.
Ransomware: This Malware is meant to encrypt all the data on the computer and
requests a victim to transfer a certain amount of money to get the decryption key. Typi-
cally, a ransomware-infected computer is ”frozen,” so the user cannot access a file, and
the screen is used to provide information about attackers’ requests [7].

5
Chapter 2
Traditional Approaches to Malware Detection
Malware detection involves using techniques and tools to identify, block, alert, and re-
spond to malware threats. Basic malware detection techniques can help identify and
restrict known threats and include signature-based detection, checksumming, and ap-
plication allowlisting.

2.1 Signature-based Detection


The proposed work deals with the analysis and scanning of the mobile applications in
the devices and present the analysis report based on the type of risk associated with the
particular application.
Fig. 1 shows the flow of execution of signature generation and malware detection
process that start with code analysis of an application which extracts useful information
like trace log, permissions and signature pattern from the. apk file of the application
used to classify malware [8] .

• Signature-based analysis is also called static code analysis. It is a cycle of pro-


gramming troubleshooting without executing the code of the program.

• The procedure of static malware examination can be realized on different portray-


als of a program.

• This method and apparatuses promptly find if a document is of the malignant


plan.

Despite the challenges, signature-based detection is still an important tool for protecting
computers from malware. It is often used in conjunction with other malware detection
methods, such as heuristic-based detection and behavioral analysis, to provide a more
comprehensive level of protection.

2.1.1 Benefits of signature-based detection:


• It is very effective at detecting known malware.

• It is relatively easy to implement and maintain.

• It is relatively inexpensive [8].

7
Chapter 2. Traditional Approaches to Malware Detection

Figure 2.1: Work flow diagram of the proposed system

2.1.2 Challenges of signature-based detection:


• It can be easily bypassed by malware authors who modify their code slightly.

• It cannot detect malware that has not yet been seen before.

• It can be slow and computationally expensive to scan large files [8].

2.2 Heuristic-based Detection


In today’s digital landscape, malware remains a persistent and evolving threat.Traditional
signature-based antivirus solutions have limitations when it comes to detecting new and
rapidly mutating malware strains. Heuristic-based malware detection has emerged as a
proactive and dynamic approach to identifying malicious software [9].
Heuristic-based detection is a malware detection method that uses rules or algo-
rithms to identify malicious code. It does this by looking for suspicious patterns or
behaviors that are common in malware, but not in legitimate software [9].
Heuristic-based detection is often used in conjunction with signature-based de-
tection, which identifies malware by matching its code against a database of known
malware signatures. Signature-based detection is more effective at detecting known
malware, but it can be easily bypassed by malware authors who modify their code
slightly [10].
Heuristic-based detection can help to catch malware that has evaded signature-based
detection.

8
2.3. Behavioral Analysis

2.2.1 Techniques used in heuristic-based detection:


• File analysis: This involves examining the code of a file to look for suspicious
patterns, such as the use of known malicious functions or the presence of suspi-
cious strings of text [10].

• API call analysis: This involves tracking the calls that a program makes to sys-
tem APIs. Malware often makes calls to APIs that are not used by legitimate
software [10]

• Process monitoring: This involves tracking the processes that are running on a
system. Malware often creates new processes or modifies existing processes in a
suspicious way [10].

• Network traffic analysis: This involves monitoring the network traffic that is
generated by a program. Malware often sends out network traffic to communicate
with other malware or to download additional malicious code [10].

2.2.2 Benefits of heuristic-based detection:


• It can detect malware that has not yet been seen before.

• It can detect malware that has been modified to evade signature-based detection.

• It can be used to detect malware that is hidden in legitimate software [9].

2.2.3 Challenges of heuristic-based detection:


• It can be slow and computationally expensive.

• It can sometimes flag legitimate software as malware.

• It can be difficult to keep up with the ever-changing malware landscape [9].

2.3 Behavioral Analysis


Behavioral analysis is a malware detection method that analyzes the behavior of a pro-
gram to determine if it is malicious. This is in contrast to signature-based detection,
which identifies malware by matching its code against a database of known malware
signatures [11].
Behavioral analysis is often used in conjunction with signature-based detection to
provide a more comprehensive level of protection. Signature-based detection is good
at detecting known malware, but it can be easily bypassed by malware authors who

9
Chapter 2. Traditional Approaches to Malware Detection

modify their code slightly. Behavioral analysis can help to catch malware that has
evaded signature-based detection [11].
There are a number of different techniques that can be used for behavioral analysis.

2.3.1 Some of the most common techniques for behavioral analysis


• Process monitoring: This involves tracking the processes that are running on a
system. Malware often creates new processes or modifies existing processes in a
way that is suspicious.

• Registry monitoring: This involves tracking the changes that are made to the
system registry. Malware often makes changes to the registry to install itself or to
modify its behavior.

• File monitoring: This involves tracking the files that are accessed or modified
by a program. Malware often accesses or modifies files that are not normally
accessed by legitimate software.

• Network traffic monitoring: This involves monitoring the network traffic that is
generated by a program. Malware often sends out network traffic to communicate
with other malware or to download additional malicious code [12].

Behavioral analysis is not perfect and can sometimes flag legitimate software as
malware. However, it is a valuable tool that can help to protect computers from a wide
range of malware threats

2.3.2 Benefits of Behavioral-based detection:


• It can detect malware that has not yet been seen before.

• It can detect malware that has been modified to evade signature-based detection.

• It can be used to detect malware that is hidden in legitimate software [12].

2.3.3 Challenges of Behavioral-based detection:


• It can be slow and computationally expensive.

• It can sometimes flag legitimate software as malware.

• It can be difficult to keep up with the ever-changing malware landscape [12].

10
Chapter 3
Introduction to Machine Learning
Machine learning is a kind of artificial intelligence that allows computers to learn from
data and make predictions. It use algorithms to detect patterns and trends, allowing
systems to improve over time. Machine learning offers a wide range of applications,
ranging from predictive analytics to autonomous systems and natural language process-
ing.

3.1 Basics of Machine Learning Algorithms


Machine learning is the process of Programming computer systems to optimize a overall
performance criterion primarily based totally on preceding enjoy or instance informa-
tion. We have a version installation to 3 parameters, and studying is the procedure of
strolling a laptop programmed to optimize a version parameter the usage of school-
ing information or preceding enjoy. The version might be predictive to make destiny
predictions, descriptive to study from information, or both [7].
It is beneficial in order to introduce system studying method as an opportunity in
comparison to conventional engineering method for the development in the case of an
algorithmic solution order to fix the concepts. The subject of interest is thoroughly in-
vestigated, resulting in a model that is mathematical accurately depicts about physics of
the setup that is under investigation. Under the belief that the supplied physics-primarily
based totally version is a practical illustration of reality, an optimized algorithm is cre-
ated based on the model that provides performance guarantees [13].

3.2 Supervised, Unsupervised, and Reinforcement Learn-


ing
ML approaches are having three different categories, As particular next.
Supervised learning: In supervised learning, the input label is used to train the AI
model based on the input that is provided and its anticipated outcome. Based on the
inputs and outputs, the model generates a mapping equation, and using that mapping
equation, it forecasts the label of the inputs going forward [14].
Assume we need to create a model that distinguishes between a cat and a dog. To
train the model, we send it many photos of cats and dogs, each labelled with whether it
is a cat or a dog. The model attempts to create a relationship between the input photos

11
Chapter 3. Introduction to Machine Learning

and their labels. After training, the model can estimate whether an image is of a cat or
a dog, even if the algorithm has never seen the image before [14].

Unsupervised learning: Unsupervised learning involves training the AI model


solely on the inputs without regard for their labels. The model divides the input data
into classes with comparable characteristics. The label of the input is then predicted in
the future based on its features’ resemblance to one of the classes [14].
Assume we have a collection of red and blue balls that we need to sort into two
groups. Assume that the only difference between the balls is their colour. The model
attempts to identify dissimilar traits between the balls based on how the model classifies
the balls into two classes. After the balls are divided into two groups according on their
colour, we get two clusters of balls, one blue and one red [14].

Reinforcement learning: The AI model in reinforcement learning attempts to per-


form the best possible action in a given situation in order to maximise total profit. The
model learns by receiving feedback on its previous results [14].
Consider the following scenario: a robot is requested to find a path between A and
B. Because it has no prior experience, the robot selects either of the pathways at first.
The robot receives input on the course it chooses and learns from it. When the robot is
in a similar position in the future, it can use feedback to fix the problem. For example, if
the robot takes option B and receives a reward, i.e., positive feedback, the robot learns
that it must choose way B again in order to maximise its reward [14].

3.3 Support Vector Machines (SVM)

Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression. Though we say regression problems as well it’s best
suited for classification. The main objective of the SVM algorithm is to find the optimal
hyperplane in an N-dimensional space that can separate the data points in different
classes in the feature space. The hyperplane tries that the margin between the closest
points of different classes should be as maximum as possible. The dimension of the
hyperplane depends upon the number of features. If the number of input features is
two, then the hyperplane is just a line. If the number of input features is three, then the
hyperplane becomes a 2-D plane. It becomes difficult to imagine when the number of
features exceeds three [15]
Types of Support Vector Machine Algorithms

12
3.4. Ensemble Methods

3.3.1 Linear SVM


When the data is perfectly linearly separable only then we can use Linear SVM. Per-
fectly linearly separable means that the data points can be classified into 2 classes by
using a single straight line(if 2D) [1].

3.3.2 Non-Linear SVM


When the data is not linearly separable then we can use Non-Linear SVM, which means
when the data points cannot be separated into 2 classes by using a straight line (if 2D)
then we use some advanced techniques like kernel tricks to classify them. In most real-
world applications we do not find linearly separable datapoints hence we use kernel
trick to solve them [1].
Important Terms
Two main terms which will be repeated again and again in this article:

• Support Vectors: These are the points that are closest to the hyperplane. A
separating line will be defined with the help of these data points.

• Margin: it is the distance between the hyperplane and the observations closest to
the hyperplane (support vectors).

3.4 Ensemble Methods


Ensemble methods are a type of machine learning technique that combine the predic-
tions of multiple individual models to produce a more accurate and robust final predic-
tion. Ensemble methods are often used in report writing to improve the quality of the
insights and recommendations that are generated [16].

3.4.1 Types of Ensemble Methods


• Averaging methods: These methods combine the predictions of individual mod-
els by taking a simple average. Averaging methods are typically used for regres-
sion tasks, where the goal is to predict a continuous value [17].

• Voting methods: These methods combine the predictions of individual models


by taking a majority vote. Voting methods are typically used for classification
tasks, where the goal is to predict a discrete category [17].
Some of the most popular ensemble methods include:

13
Chapter 3. Introduction to Machine Learning

Figure 3.1: Support Vector Machines (SVM) [1]

14
3.4. Ensemble Methods

• Bagging: Bagging creates multiple training datasets by randomly sampling from


the original training dataset with replacement. Each training dataset is then used
to train a separate model. The predictions of the individual models are then aver-
aged to produce the final prediction [18].

• Boosting: Boosting trains a sequence of models, where each model focuses on


the examples that were misclassified by the previous model. The predictions
of the individual models are then weighted and combined to produce the final
prediction [18].

• Stacking: Stacking trains a meta-model to combine the predictions of individual


models. The meta-model is trained on the predictions of the individual models
and the target labels from the training dataset. The meta-model is then used to
make predictions on new data. [18]

3.4.2 Ensemble methods can be used to improve the perfor-


mance of machine learning

• Reducing overfitting: Ensemble methods can help to reduce overfitting by com-


bining the predictions of multiple models that have been trained on different sub-
sets of the training data. This makes it less likely that the final prediction will be
biased towards any particular subset of the training data [19].

• Improving accuracy: Ensemble methods can often improve the accuracy of ma-
chine learning models by combining the predictions of multiple models that have
different strengths and weaknesses. This is because the different models may be
able to capture different aspects of the data [19].

• Reducing variance: Ensemble methods can help to reduce the variance of ma-
chine learning models by averaging the predictions of multiple models. This
makes the final prediction more stable and less likely to fluctuate wildly on new
data [19].

15
Chapter 3. Introduction to Machine Learning

Figure 3.2: Proposed ensemble method for classification of both imbalanced balanced
malaria disease datasets. [2]

16
Chapter 4
Case Studies and Real-world Applications
Machine learning (ML) is increasingly being used to detect malware in real time. ML
models can be trained on large datasets of known malware and benign files to learn the
patterns that distinguish the two. Once trained, these models can be used to scan new
files and identify potential malware with high accuracy.

4.1 Malware Detection in Antivirus Software


• Google Play Protect: Google Play Protect is a security service that scans apps
on Android devices for malware. Google Play Protect uses a variety of machine
learning algorithms to detect malware, including anomaly detection, signature-
based detection, and heuristic analysis [20].

• Microsoft Defender Antivirus: Microsoft Defender Antivirus is a security soft-


ware that protects Windows devices from malware and other threats. Microsoft
Defender Antivirus uses machine learning to detect malware in a variety of ways,
including analyzing the behavior of files and processes, and detecting malicious
code within files [20].

• CrowdStrike Falcon: CrowdStrike Falcon is a cloud-based security platform


that protects endpoints from malware and other threats. CrowdStrike Falcon uses
machine learning to detect malware in a variety of ways, including analyzing the
behavior of files and processes, and detecting malicious code within files [20].

In addition to these commercial products, there are also a number of open source ma-
chine learning tools that can be used for malware detection. For example, the Malicious
Software Classification Toolkit (MalConv) is a Python library that provides a variety of
machine learning algorithms for malware detection [21].

4.2 Mobile Malware Detection


• In 2017, Google Play Protect used machine learning to detect and remove a new
type of malware called Mazar from over 10 million Android devices [21].

• In 2018, Microsoft Defender Antivirus used machine learning to detect and block
a new type of malware called Emotet, which was targeting businesses and orga-
nizations around the world [21].

17
Chapter 4. Case Studies and Real-world Applications

Figure 4.1: Proposed ML malware detection method [3]

• In 2019, CrowdStrike Falcon used machine learning to detect and block a new
type of malware called Doppelgängers, which was designed to impersonate legit-
imate software [21].

Machine learning has proven to be a very effective tool for malware detection. By
learning the patterns that distinguish malware from benign files, ML models can detect
new malware strains that traditional signature-based detection methods may miss [22].

4.2.1 Benefits of using machine learning for malware detection


There are a number of benefits to using machine learning for malware detection, includ-
ing:
Accuracy: Machine learning models can be trained to detect malware with very
high accuracy. This is because ML models can learn the patterns that distinguish mal-
ware from benign files, even if these patterns are complex and subtle [22].
Speed: Machine learning models can scan files for malware very quickly. This is
important because new malware strains are being released all the time, and it is impor-
tant to be able to detect them quickly [22].
Adaptability: Machine learning models can be adapted to detect new malware
strains. This is because ML models can learn from new data, and they can be updated

18
4.3. Limitations of Machine Learning in Malware Detection

regularly to reflect the latest threats [22].


Overall, machine learning is a very effective tool for malware detection. It offers a
number of benefits over traditional signature-based detection methods, including accu-
racy, speed, and adaptability [23].

4.3 Limitations of Machine Learning in Malware De-


tection
Machine learning (ML) is a powerful tool for malware detection, but it has some limi-
tations. Here are a few of the most important limitations:
Adversarial attacks: Adversarial attacks are designed to fool machine learning
models into making incorrect predictions. Adversarial attacks can be used to create
malware that is specifically designed to evade detection by ML models [24].
Data requirements: ML models need to be trained on large datasets of malware
and benign files in order to be effective. This can be a challenge for organizations that
do not have access to large datasets [24].
Interpretability: ML models can be difficult to interpret, which can make it diffi-
cult to understand why they make certain predictions. This can make it difficult to trust
ML models and to identify and fix any problems with them [24].
Cost: Training and maintaining ML models can be expensive. This can be a barrier
for small businesses and organizations [24].
Despite these limitations, machine learning is still a valuable tool for malware de-
tection. ML models can be used to detect new malware strains that traditional signature-
based detection methods may miss.
Additionally, ML models can be used to improve the accuracy and speed of malware
detection [23].

4.3.1 Things that can be done to reduce the limitations of machine


learning
• Use a variety of detection methods: It is important to use a variety of malware
detection methods, including ML-based methods, signature-based methods, and
heuristic analysis. This will help to reduce the risk of malware evading detection
[25].

• Use adversarial training: Adversarial training is a technique that can be used


to make ML models more resistant to adversarial attacks. Adversarial training
involves training the model on adversarial examples, which are examples that
have been designed to fool the model [25].

19
Chapter 4. Case Studies and Real-world Applications

• Use explainable AI: Explainable AI (XAI) is a field of research that focuses on


developing ML models that are more interpretable. XAI techniques can be used
to help understand why ML models make certain predictions and to identify any
problems with them [26].

• Invest in ML expertise: It is important to invest in ML expertise in order to


develop and maintain effective ML models for malware detection. This will help
to ensure that the models are accurate and reliable [26].

20
Chapter 5
Summary and Future scope
This report provides a comprehensive exploration of the critical domain of malware
detection, with a focus on the integration of machine learning techniques. It begins by
establishing a solid foundation in the understanding of malware, including its various
forms and types. Traditional detection methods, such as signature-based, heuristic-
based, and behavioral-based approaches, are examined in detail, shedding light on their
strengths and limitations.
The report then transitions to the realm of machine learning, introducing core con-
cepts like supervised and unsupervised learning, Support Vector Machines, and Ensem-
ble Methods. This knowledge equips readers with the necessary tools to appreciate how
machine learning can be harnessed for effective malware detection.
Chapter 4 presents compelling case studies and real-world applications, highlighting
the practicality and effectiveness of employing machine learning in the fight against
malware. These examples demonstrate the tangible impact of this technology in the
cybersecurity landscape.
In conclusion, this report underscores the significance of leveraging the power of
machine learning to address the ever-evolving threat landscape of malware. It serves
as an invaluable resource for security professionals, researchers, and anyone seeking
insights into the dynamic field of malware detection and the pivotal role played by
machine learning in enhancing cybersecurity. By exploring the challenges, benefits,
and real-world applications of machine learning in this context, this report equips its
readers with the knowledge needed to combat malware effectively in today’s digital
environment.

21
References
[1] O. Hamidi, J. Poorolajal, M. Sadeghifar, H. Abbasi, Z. Maryanaji, H. R. Faridi,
and L. Tapak, “A comparative study of support vector machines and artificial neu-
ral networks for predicting precipitation in iran,” Theoretical and applied clima-
tology, vol. 119, pp. 723–731, 2015.

[2] “ITU-T Rec. H.265 and ISO/IEC 23008-2: High Efficiency Video Coding
(HEVC),” ITU-T and ISO/IEC, 2013.

[3] M. S. Akhtar and T. Feng, “Malware analysis and detection using machine
learning algorithms,” Symmetry, vol. 14, no. 11, 2022. [Online]. Available:
https://www.mdpi.com/2073-8994/14/11/2304

[4] D. Ucci and L. Aniello, “Survey on the usage of machine learning techniques for
malware analysis,” Computers Security, vol. 81, 10 2017.

[5] D. Gavriluţ, M. Cimpoeşu, D. Anton, and L. Ciortuz, “Malware detection using


machine learning,” in 2009 International Multiconference on Computer Science
and Information Technology, 2009, pp. 735–741.

[6] B. Amos, H. Turner, and J. White, “Applying machine learning classifiers to dy-
namic android malware detection at scale,” in 2013 9th International Wireless
Communications and Mobile Computing Conference (IWCMC), 2013, pp. 1666–
1671.

[7] D. Gibert Llauradó, C. Mateu Piñol, and J. Planes Cid, “The rise of machine learn-
ing for detection and classification of malware: Research developments, trends
and challenge,” Journal of Network and Computer Applications, 2020, vol. 153,
102526, 2020.

[8] N. Pachhala, S. Jothilakshmi, and B. P. Battula, “A comprehensive survey on


identification of malware types and malware classification using machine learn-
ing techniques,” in 2021 2nd International Conference on Smart Electronics and
Communication (ICOSEC). IEEE, 2021, pp. 1207–1214.

[9] P. R. Pardhi, J. K. Rout, and N. K. Ray, “Implementation of a malware scanner


using signature-based approach for android applications,” in 2021 19th OITS In-
ternational Conference on Information Technology (OCIT). IEEE, 2021, pp.
14–19.

[10] Ö. A. Aslan and R. Samet, “A comprehensive review on malware detection ap-
proaches,” IEEE access, vol. 8, pp. 6249–6271, 2020.

23
References

[11] L. Mendonça and H. Santos, “Botnets: a heuristic-based detection framework,” in


Proceedings of the Fifth International Conference on Security of Information and
Networks, 2012, pp. 33–40.

[12] T. Isohara, K. Takemori, and A. Kubota, “Kernel-based behavior analysis for an-
droid malware detection,” in 2011 seventh international conference on computa-
tional intelligence and security. IEEE, 2011, pp. 1011–1015.

[13] T.-E. Wei, C.-H. Mao, A. B. Jeng, H.-M. Lee, H.-T. Wang, and D.-J. Wu, “An-
droid malware detection via a latent network behavior analysis,” in 2012 IEEE
11th international conference on trust, security and privacy in computing and
communications. IEEE, 2012, pp. 1251–1258.

[14] “Educative, inc. :supervised vs. unsupervised vs. reinforcement learning,”


[Visited on: 2023-10-20]. [Online]. Available: https://www.educative.io/answers/
supervised-vs-unsupervised-vs-reinforcement-learning

[15] M. Kaur, A. K. Shukla, and S. Kaur, “An introduction to machine learning in a


nutshell,” in 2021 10th International Conference on System Modeling & Advance-
ment in Research Trends (SMART). IEEE, 2021, pp. 17–22.

[16] S. Ardabili, A. Mosavi, and A. R. Várkonyi-Kóczy, “Advances in machine learn-


ing modeling reviewing hybrid and ensemble methods,” in International confer-
ence on global research and education. Springer, 2019, pp. 215–227.

[17] P. Kazienko, E. Lughofer, and B. Trawiński, “Hybrid and ensemble methods in


machine learning j. ucs special issue,” J Univers Comput Sci, vol. 19, no. 4, pp.
457–461, 2013.

[18] “Guide on support vector machine (svm) algorithm,” [Visited on: 2023-
09-05]. [Online]. Available: https://www.analyticsvidhya.com/blog/2021/10/
support-vector-machinessvm-a-complete-guide-for-beginners/#

[19] T. G. Dietterich, “Ensemble methods in machine learning,” in International work-


shop on multiple classifier systems. Springer, 2000, pp. 1–15.

[20] O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley Interdisciplinary


Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1249, 2018.

[21] M. S. Akhtar and T. Feng, “Iota based anomaly detection machine learning in
mobile sensing,” EAI Endorsed Transactions on Creative Technologies, vol. 9,
no. 30, pp. e1–e1, 2022.

24
References

[22] A. Demontis, M. Melis, B. Biggio, D. Maiorca, D. Arp, K. Rieck, I. Corona,


G. Giacinto, and F. Roli, “Yes, machine learning can be more secure! a case study
on android malware detection,” IEEE transactions on dependable and secure com-
puting, vol. 16, no. 4, pp. 711–724, 2017.

[23] K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun, and H. Liu, “A review of android mal-
ware detection approaches based on machine learning,” IEEE Access, vol. 8, pp.
124 579–124 607, 2020.

[24] A. Gaurav, B. B. Gupta, and P. K. Panigrahi, “A comprehensive survey on machine


learning approaches for malware detection in iot-based enterprise information sys-
tem,” Enterprise Information Systems, vol. 17, no. 3, p. 2023764, 2023.

[25] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversar-


ial examples,” arXiv preprint arXiv:1412.6572, 2014.

[26] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining
the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD in-
ternational conference on knowledge discovery and data mining, 2016, pp. 1135–
1144.

25

You might also like