Malware - Detection - Using - Machine - Learning (3) - Removed
Malware - Detection - Using - Machine - Learning (3) - Removed
Seminar Report
On
Malware Detection Using Machine
Learning
Submitted to the Department of Electronics Engineering in Partial Fulfilment for the
Requirements for the Degree of
Bachelor of Technology
(Electronics and Communication)
by
Guided by
Dr J. N. PATEL
Associate Professor, DECE
CERTIFICATE
This is to certify that the SEMINAR REPORT entitled “Malware Detection Us-
ing Machine Learning” is presented & submitted by Candidate MANISH KUMAR
SINGH , bearing Roll No.U21EC123, of B.Tech. III, 5th Semester in the partial fulfill-
ment of the requirement for the award of B.Tech. Degree in Electronics & Communica-
tion Engineering for academic year 2023 - 24.
He has successfully and satisfactorily completed his/her Seminar Exam in all re-
spects. We certify that the work is comprehensive, complete and fit for evaluation.
1.
2.
Dr J. N. SARVAIYA
Head & Professor Seal of The Department
DECE, SVNIT (November 2023)
Dr. J. N. PATEL
Associate Professor & Seminar Guide
Acknowledgements
I would like to express my profound gratitude and deep regards to my guide Dr.J.N.
Patel for his guidance. I am heartily thankful for suggestion and the clarity of the
concepts of the topic that helped me a lot for this work.
I would also like to thank Dr. J. N. SARVAIYA, Head of the Electronics Engi-
neering Department, SVNIT and all the faculties of ECED for their co-operation and
suggestions.
Finally, I would like to thank my classmates for their love, support and encourage-
ment during this challenging period. They have always been there for me and helped
me overcome any difficulties.
Nov 27 , 2023
v
Abstract
This report delves into the critical domain of malware detection by exploring the in-
tegration of machine learning techniques. The study encompasses various aspects of
malware, their classification, and the evolution of traditional detection methods, as well
as an in-depth examination of machine learning approaches. The objective is to pro-
vide an extensive overview of the challenges, benefits, and real-world applications of
machine learning in the realm of malware detection.
Moving forward, the report examines the traditional approaches to malware de-
tection. It highlights signature-based, heuristic-based, and behavioral-based detection
methods, elucidating their strengths and limitations. By gaining insights into these
conventional methods, we set the stage for a comprehensive exploration of machine
learning.
Chapter 4 delves into case studies and real-world applications that showcase the
practicality and efficacy of applying machine learning to malware detection. These
real-world examples offer a glimpse into the tangible impact of this technology in the
field of cybersecurity.
The report concludes by summarizing the key takeaways from each chapter and
emphasizing the importance of harnessing the power of machine learning to combat
the evolving threat landscape of malware. This report serves as a valuable resource for
security professionals, researchers, and anyone interested in understanding the dynamic
landscape of malware detection and the pivotal role played by machine learning in this
field.
vii
Table of Contents
Page
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapters
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Scope and Significance . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Sope: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Significance: . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Definition and Classification of Malware . . . . . . . . . . . . . . . . . 3
1.2.1 Static Malware Analysis . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Dynamic Malware Analysis . . . . . . . . . . . . . . . . . . . 3
1.2.3 Memory-damaging Malware Analysis . . . . . . . . . . . . . . 3
1.3 Common Types of Malware . . . . . . . . . . . . . . . . . . . . . . . 4
2 Traditional Approaches to Malware Detection . . . . . . . . . . . . . . . . . 7
2.1 Signature-based Detection . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Benefits of signature-based detection: . . . . . . . . . . . . . . 7
2.1.2 Challenges of signature-based detection: . . . . . . . . . . . . . 8
2.2 Heuristic-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Techniques used in heuristic-based detection: . . . . . . . . . . 9
2.2.2 Benefits of heuristic-based detection: . . . . . . . . . . . . . . 9
2.2.3 Challenges of heuristic-based detection: . . . . . . . . . . . . . 9
2.3 Behavioral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Some of the most common techniques for behavioral analysis . 10
2.3.2 Benefits of Behavioral-based detection: . . . . . . . . . . . . . 10
2.3.3 Challenges of Behavioral-based detection: . . . . . . . . . . . . 10
3 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Basics of Machine Learning Algorithms . . . . . . . . . . . . . . . . . 11
3.2 Supervised, Unsupervised, and Reinforcement Learning . . . . . . . . 11
3.3 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Non-Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Types of Ensemble Methods . . . . . . . . . . . . . . . . . . . 13
ix
Table of Contents
x
List of Figures
xi
List of Abbreviations
ML Machine Learning
AI Artificial Intelligence
API Application programming interface
DLL Dynamic Link Library
IDA Interactive Disassembler
CAT Computer-Aided Translation
APK Android Application Package
SVM Support Vector Machine
2D Two Dimension
XAI Explainable Artificial Intelligence
xv
Chapter 1
INTRODUCTION
MALWARE is defined as software designed to infiltrate or damage a computer system
without the owner’s informed consent. Malware is actually a generic definition for all
kind of computer threats. A simple classification of malware consists of file infectors
and stand-alone malware. Another way of classifying malware is based on their partic-
ular action: worms, backdoors, trojans, rootkits, spyware, adware etc.
Malware detection is a challenging task due to the rapid evolution of malware and
the increasing sophistication of attacks. Traditional signature-based detection methods
are no longer sufficient to protect against new and emerging malware. Machine learning
(ML) offers a promising approach to malware detection, as it can be used to learn the
characteristics of malware and distinguish them from benign files [4].
ML-based malware detection systems typically extract features from malware and
benign files. These features can be based on a variety of factors, such as the file’s size,
entropy, opcode distribution, and API calls. Once the features have been extracted, an
ML algorithm is used to train a model that can classify files as malware or benign.
Once the model is trained, it can be used to scan new files for malware. If the model
predicts that a file is malware, it can be quarantined or deleted. ML-based malware
detection systems have several advantages over traditional signature-based methods.
First, ML systems can be more effective at detecting new and emerging malware, as
they do not rely on a database of known malware signatures. Second, ML systems
can be more adaptive to changes in malware, as they can be retrained on new data as
needed [4].
However, ML-based malware detection systems also have some challenges. One
challenge is that ML systems can be computationally expensive to train and deploy.
Another challenge is that ML systems can be vulnerable to adversarial attacks, where
attackers craft malware specifically designed to fool ML systems.
Despite these challenges, ML-based malware detection systems are becoming in-
creasingly widely used. Many commercial antivirus products now use ML to detect
malware. Additionally, researchers are actively developing new ML-based malware
detection techniques.
Malware detection through standard, signature based methods is getting more and
more difficult since all current malware applications tend to have multiple polymorphic
layers to avoid detection or to use side mechanisms to automatically update themselves
to a newer version at short periods of time in order to avoid detection by any antivirus
software. For an example of dynamical file analysis for malware detection, via emula-
tion in a virtual environment, the interested reader can see [5].
1
Chapter 1. INTRODUCTION
1.1.1 Sope:
The International Data Corporation states that Android’s market share has grown to 136
million units and 75percent of the market share in 3Q 2012. Bloomberg Businessweek
reports there are 700,000 applications available on Google Play as of October 29, 2012.
While most of these applications are benign, TrendLabs reports in their 3Q 2012 se-
curity roundup that the top ten installed malicious Android applications had 71, 520
installations total.
A key open challenge is the lack of large-scale studies that use hundreds or thou-
sands of mobile malware samples to analyze the effectiveness of well-known machine-
learning algorithms. Studies that provide empirical results from large-scale experiments
are needed to focus research and understand the effectiveness of current algorithms.
Key challenges include acquiring thousands of malware samples, orchestrating the in-
stall→profile→clean cycle, and training the machine learning classifiers [5].
1.1.2 Significance:
2
1.2. Definition and Classification of Malware
3
Chapter 1. INTRODUCTION
4
1.3. Common Types of Malware
5
Chapter 2
Traditional Approaches to Malware Detection
Malware detection involves using techniques and tools to identify, block, alert, and re-
spond to malware threats. Basic malware detection techniques can help identify and
restrict known threats and include signature-based detection, checksumming, and ap-
plication allowlisting.
Despite the challenges, signature-based detection is still an important tool for protecting
computers from malware. It is often used in conjunction with other malware detection
methods, such as heuristic-based detection and behavioral analysis, to provide a more
comprehensive level of protection.
7
Chapter 2. Traditional Approaches to Malware Detection
• It cannot detect malware that has not yet been seen before.
8
2.3. Behavioral Analysis
• API call analysis: This involves tracking the calls that a program makes to sys-
tem APIs. Malware often makes calls to APIs that are not used by legitimate
software [10]
• Process monitoring: This involves tracking the processes that are running on a
system. Malware often creates new processes or modifies existing processes in a
suspicious way [10].
• Network traffic analysis: This involves monitoring the network traffic that is
generated by a program. Malware often sends out network traffic to communicate
with other malware or to download additional malicious code [10].
• It can detect malware that has been modified to evade signature-based detection.
9
Chapter 2. Traditional Approaches to Malware Detection
modify their code slightly. Behavioral analysis can help to catch malware that has
evaded signature-based detection [11].
There are a number of different techniques that can be used for behavioral analysis.
• Registry monitoring: This involves tracking the changes that are made to the
system registry. Malware often makes changes to the registry to install itself or to
modify its behavior.
• File monitoring: This involves tracking the files that are accessed or modified
by a program. Malware often accesses or modifies files that are not normally
accessed by legitimate software.
• Network traffic monitoring: This involves monitoring the network traffic that is
generated by a program. Malware often sends out network traffic to communicate
with other malware or to download additional malicious code [12].
Behavioral analysis is not perfect and can sometimes flag legitimate software as
malware. However, it is a valuable tool that can help to protect computers from a wide
range of malware threats
• It can detect malware that has been modified to evade signature-based detection.
10
Chapter 3
Introduction to Machine Learning
Machine learning is a kind of artificial intelligence that allows computers to learn from
data and make predictions. It use algorithms to detect patterns and trends, allowing
systems to improve over time. Machine learning offers a wide range of applications,
ranging from predictive analytics to autonomous systems and natural language process-
ing.
11
Chapter 3. Introduction to Machine Learning
and their labels. After training, the model can estimate whether an image is of a cat or
a dog, even if the algorithm has never seen the image before [14].
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression. Though we say regression problems as well it’s best
suited for classification. The main objective of the SVM algorithm is to find the optimal
hyperplane in an N-dimensional space that can separate the data points in different
classes in the feature space. The hyperplane tries that the margin between the closest
points of different classes should be as maximum as possible. The dimension of the
hyperplane depends upon the number of features. If the number of input features is
two, then the hyperplane is just a line. If the number of input features is three, then the
hyperplane becomes a 2-D plane. It becomes difficult to imagine when the number of
features exceeds three [15]
Types of Support Vector Machine Algorithms
12
3.4. Ensemble Methods
• Support Vectors: These are the points that are closest to the hyperplane. A
separating line will be defined with the help of these data points.
• Margin: it is the distance between the hyperplane and the observations closest to
the hyperplane (support vectors).
13
Chapter 3. Introduction to Machine Learning
14
3.4. Ensemble Methods
• Improving accuracy: Ensemble methods can often improve the accuracy of ma-
chine learning models by combining the predictions of multiple models that have
different strengths and weaknesses. This is because the different models may be
able to capture different aspects of the data [19].
• Reducing variance: Ensemble methods can help to reduce the variance of ma-
chine learning models by averaging the predictions of multiple models. This
makes the final prediction more stable and less likely to fluctuate wildly on new
data [19].
15
Chapter 3. Introduction to Machine Learning
Figure 3.2: Proposed ensemble method for classification of both imbalanced balanced
malaria disease datasets. [2]
16
Chapter 4
Case Studies and Real-world Applications
Machine learning (ML) is increasingly being used to detect malware in real time. ML
models can be trained on large datasets of known malware and benign files to learn the
patterns that distinguish the two. Once trained, these models can be used to scan new
files and identify potential malware with high accuracy.
In addition to these commercial products, there are also a number of open source ma-
chine learning tools that can be used for malware detection. For example, the Malicious
Software Classification Toolkit (MalConv) is a Python library that provides a variety of
machine learning algorithms for malware detection [21].
• In 2018, Microsoft Defender Antivirus used machine learning to detect and block
a new type of malware called Emotet, which was targeting businesses and orga-
nizations around the world [21].
17
Chapter 4. Case Studies and Real-world Applications
• In 2019, CrowdStrike Falcon used machine learning to detect and block a new
type of malware called Doppelgängers, which was designed to impersonate legit-
imate software [21].
Machine learning has proven to be a very effective tool for malware detection. By
learning the patterns that distinguish malware from benign files, ML models can detect
new malware strains that traditional signature-based detection methods may miss [22].
18
4.3. Limitations of Machine Learning in Malware Detection
19
Chapter 4. Case Studies and Real-world Applications
20
Chapter 5
Summary and Future scope
This report provides a comprehensive exploration of the critical domain of malware
detection, with a focus on the integration of machine learning techniques. It begins by
establishing a solid foundation in the understanding of malware, including its various
forms and types. Traditional detection methods, such as signature-based, heuristic-
based, and behavioral-based approaches, are examined in detail, shedding light on their
strengths and limitations.
The report then transitions to the realm of machine learning, introducing core con-
cepts like supervised and unsupervised learning, Support Vector Machines, and Ensem-
ble Methods. This knowledge equips readers with the necessary tools to appreciate how
machine learning can be harnessed for effective malware detection.
Chapter 4 presents compelling case studies and real-world applications, highlighting
the practicality and effectiveness of employing machine learning in the fight against
malware. These examples demonstrate the tangible impact of this technology in the
cybersecurity landscape.
In conclusion, this report underscores the significance of leveraging the power of
machine learning to address the ever-evolving threat landscape of malware. It serves
as an invaluable resource for security professionals, researchers, and anyone seeking
insights into the dynamic field of malware detection and the pivotal role played by
machine learning in enhancing cybersecurity. By exploring the challenges, benefits,
and real-world applications of machine learning in this context, this report equips its
readers with the knowledge needed to combat malware effectively in today’s digital
environment.
21
References
[1] O. Hamidi, J. Poorolajal, M. Sadeghifar, H. Abbasi, Z. Maryanaji, H. R. Faridi,
and L. Tapak, “A comparative study of support vector machines and artificial neu-
ral networks for predicting precipitation in iran,” Theoretical and applied clima-
tology, vol. 119, pp. 723–731, 2015.
[2] “ITU-T Rec. H.265 and ISO/IEC 23008-2: High Efficiency Video Coding
(HEVC),” ITU-T and ISO/IEC, 2013.
[3] M. S. Akhtar and T. Feng, “Malware analysis and detection using machine
learning algorithms,” Symmetry, vol. 14, no. 11, 2022. [Online]. Available:
https://www.mdpi.com/2073-8994/14/11/2304
[4] D. Ucci and L. Aniello, “Survey on the usage of machine learning techniques for
malware analysis,” Computers Security, vol. 81, 10 2017.
[6] B. Amos, H. Turner, and J. White, “Applying machine learning classifiers to dy-
namic android malware detection at scale,” in 2013 9th International Wireless
Communications and Mobile Computing Conference (IWCMC), 2013, pp. 1666–
1671.
[7] D. Gibert Llauradó, C. Mateu Piñol, and J. Planes Cid, “The rise of machine learn-
ing for detection and classification of malware: Research developments, trends
and challenge,” Journal of Network and Computer Applications, 2020, vol. 153,
102526, 2020.
[10] Ö. A. Aslan and R. Samet, “A comprehensive review on malware detection ap-
proaches,” IEEE access, vol. 8, pp. 6249–6271, 2020.
23
References
[12] T. Isohara, K. Takemori, and A. Kubota, “Kernel-based behavior analysis for an-
droid malware detection,” in 2011 seventh international conference on computa-
tional intelligence and security. IEEE, 2011, pp. 1011–1015.
[13] T.-E. Wei, C.-H. Mao, A. B. Jeng, H.-M. Lee, H.-T. Wang, and D.-J. Wu, “An-
droid malware detection via a latent network behavior analysis,” in 2012 IEEE
11th international conference on trust, security and privacy in computing and
communications. IEEE, 2012, pp. 1251–1258.
[18] “Guide on support vector machine (svm) algorithm,” [Visited on: 2023-
09-05]. [Online]. Available: https://www.analyticsvidhya.com/blog/2021/10/
support-vector-machinessvm-a-complete-guide-for-beginners/#
[21] M. S. Akhtar and T. Feng, “Iota based anomaly detection machine learning in
mobile sensing,” EAI Endorsed Transactions on Creative Technologies, vol. 9,
no. 30, pp. e1–e1, 2022.
24
References
[23] K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun, and H. Liu, “A review of android mal-
ware detection approaches based on machine learning,” IEEE Access, vol. 8, pp.
124 579–124 607, 2020.
[26] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” explaining
the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD in-
ternational conference on knowledge discovery and data mining, 2016, pp. 1135–
1144.
25