0% found this document useful (0 votes)
67 views

PDF Malware Detection Toward Machine Learning Modeling With Explainability Analysis

This document discusses a study on developing machine learning models for PDF malware detection with explainable analysis. The study created a dataset of over 15,000 PDF samples and extracted features from them using PDF analysis tools. Various machine learning classifiers were explored using the feature set, with random forest achieving the best accuracy of around 2% improvement. A decision tree was also generated to demonstrate the model's explainability by producing rules for human interpretation. The study aims to help address current difficulties in efficiently detecting PDF malware.

Uploaded by

Victoria
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

PDF Malware Detection Toward Machine Learning Modeling With Explainability Analysis

This document discusses a study on developing machine learning models for PDF malware detection with explainable analysis. The study created a dataset of over 15,000 PDF samples and extracted features from them using PDF analysis tools. Various machine learning classifiers were explored using the feature set, with random forest achieving the best accuracy of around 2% improvement. A decision tree was also generated to demonstrate the model's explainability by producing rules for human interpretation. The study aims to help address current difficulties in efficiently detecting PDF malware.

Uploaded by

Victoria
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Received 29 December 2023, accepted 18 January 2024, date of publication 23 January 2024, date of current version 30 January 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3357620

PDF Malware Detection: Toward Machine


Learning Modeling With
Explainability Analysis
G. M. SAKHAWAT HOSSAIN 1,2 , KAUSHIK DEB 1, HELGE JANICKE 3,4 ,

AND IQBAL H. SARKER 3,4 , (Member, IEEE)


1 Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram 4349, Bangladesh
2 Department of Computer Science and Engineering, Rangamati Science and Technology University, Chattogram 4500, Bangladesh
3 Cyber Security Cooperative Research Centre, Joondalup, WA 6027, Australia
4 Security Research Institute, School of Science, Edith Cowan University, Perth, WA 6027, Australia

Corresponding authors: Kaushik Deb ([email protected]) and Iqbal H. Sarker ([email protected])


This work was supported by ECU Security Research Institute, School of Science, Edith Cowan University (ECU), Australia.

ABSTRACT The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters
insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions
and identification techniques are often insufficient and may only partially prevent PDF malware because of
their versatile character and excessive dependence on a certain typical feature set. The primary goal of this
work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal,
we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent,
malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD,
PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly
created dataset. In addition, we generate a number of derivations of features that have been experimentally
proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable
feature set through the proper empirical analysis of the extracted and derived features. We explore different
baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the
Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model’s
explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make
a comparison with previous studies and point out some important findings.

INDEX TERMS Cybersecurity, PDF malware, data analytics, machine learning, decision rule, explainable
AI, human interpretation.

I. INTRODUCTION acts perpetrated by PDF malware, including the creation


In today’s digital world, the majority of our tasks are of backdoors, password theft, spyware deployment, internet
associated with the use of the global web, making it browser compromise, data spilling, social engineering, and
increasingly essential to protect our data, information, and scams. Therefore, one of the biggest hurdles in the modern
applications, in the face of a variety of cyber criminals who world is the identification of PDF malware because attackers
continually attempt to construct brand-new illicit programs generate many kinds of such malware and additionally its
and strikes to harm the facilities [1]. Despite ever-increasing traits are changing swiftly on a daily basis. There are
security improvements over time, PDF remains a favorite primarily two methods for identifying malware: behavior-
breach vector for adversaries to distribute malware and launch based detection and signature-based detection. The features
their attack activities [2]. There are several possible damaging of the underlying object are used to establish a unique
signature in the signature-based approach. The method
The associate editor coordinating the review of this manuscript and effectively detects the existence of a digital signature by
approving it for publication was Mahdi Zareei . inspecting the object. On the other hand, using machine
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 13833
G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

intelligence and other techniques, the behavior-based strategy consumes a longer period and necessitates an additional
can recognize unidentified and sophisticated malware to complex analysis method [10].
some extent, although it is a complex process. Current malware identification methods frequently choose
The fundamental structure of a PDF that contains header, feature sets according to findings from a manual inspection
body, xref (cross reference) table, and trailer is illustrated of harmful PDF files and are guided by the expertise
in Fig. 1 [3]. The PDF’s header indicates which version of of the specialist. The chosen features, nevertheless, are
the parser format will be used. Text blocks, typefaces, file- occasionally exclusive for fraudulent files, luring adversaries
specific metadata, and images are all included in the PDF’s to possibly gain authority over how and what a malicious
body, which also specifies its content [4]. There are four file looks like, and evading the current detectors (while
categories into which the contents of PDF can be placed: preserving their malicious properties). For instance, the
numbers, strings, streams, and booleans [5]. Each item in the Mimicry [11] and Reverse Mimicry [12] incidents have been
PDF file has an entry in the cross-reference table that details exacerbated by the observation that the program builders
its byte offset or placement in the file as well as enables infrequently disclose comprehensive information about the
speedy random access to particular objects, facilitating measures adopted to maintain their integrity and resistance
effective document exploration and content retrieval. A PDF to risks. In addition, the data that is accessible to developers,
reader or parser may traverse and access the different items such as clean and harmful samples, datasets, vulnerabilities,
within the file by using the trailer, which gives them all the payloads, and attack vectors employed within, also constrains
necessary information such as PDF size, root object, metadata their work. Such situations result in the produced solutions
info, encryption info, and unique identifier of the PDF file. becoming outdated considerably earlier than what the diligent
PDF malware can usually be created by injecting malicious developers had planned.
content or programs into the elements of the fundamental Machine learning applications have advanced to the point
structure of a PDF. where they can now protect systems from threats or aid
forensic professionals in their investigations by spotting
likely malicious PDF files [13]. However, adversarial tech-
niques have grown capable of compromising threat document
analyzers. Numerous machine-learning-based detection tools
are at risk because their identification of well-crafted evasive
scenarios may be erroneous [14], [15]. Various evaluations
or detection methods have been created to detect specific
incidents, but the immediate threat posed by evasive attacks
has not yet been mitigated.
Developing feature engineering improvements integrating
the adversarial behaviors of malicious PDFs for creating
harmful PDF classifiers is challenging, yet necessary, and
has an opportunity to have a significant impact in the field.
We look at ways to improve the identification approach for
PDF malware by 1) introducing an inclusive dataset that
FIGURE 1. A sample structure of PDF.
contains evasive characteristics of suspicious PDFs along
with clean and harmful PDF samples, 2) extracting the
PDF malware can be analyzed using a static, dynamic, features of the PDF samples and 3) merging the most
or hybrid approach [6]. The static technique inspects malware significant features to develop an effective feature set that can
refraining from executing the program that it embeds, but be fed into a classifier to produce a higher level of accuracy.
the dynamic method inspects malware by executing its We provide a comprehensive analysis of the most significant
code [7], [8]. Static analysis becomes susceptible when features identified for PDF malware detection and interpret
extensive evasion and fraudulent methods are used to disguise the classifier’s performance. In summary, our contributions
harmful execution behavior. In the present cybersecurity can be outlined as follows:
circumstances, depending solely on static inspection is often • We have developed a comprehensive dataset that
inadequate since a perpetrator who is dedicated to their attack consists of a total of 15958 PDF samples including
would disguise and encode their code, making it normally 7500 clean PDFs, 7666 malicious PDFs, and 792 evasive
invisible to static inspection [4]. Dynamic techniques, on the PDFs by considering the non-malicious, malicious, and
contrary, are more resilient to code deception, causing them evasive natures of the PDF samples. For this, we use
to be a better defense against advanced viruses [9]. Dynamic three popular PDF analysis tools viz. PDFiD [16],
analysis is often slow and challenging, but static analysis PDFINFO [17], and PDF-PARSER [18].
tends to be fast. Integrating the two approaches results • We develop a method to build an explicable feature set
in hybrid analysis, which is more effective in combating by taking into account the feature’s characteristics and
advanced malware than either method alone but additionally importance score.

13834 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

• We have designed an architecture for malicious PDF method, catchphrases, object names, and intelligible strings
detection and explored different machine learning clas- in JavaScript. Additionally, since subtle changes have a
sifiers to analyze and compare their efficacy in different significant impact on AI calculations, it is challenging to
cases. develop hostile models when the attributes vary. To reduce
• We have demonstrated the model’s explainability by the risk of malicious attacks while maintaining structures and
creating a decision tree that generates rules for human data properties, they developed a classification model using
interpretation. discovery-type models. They created an adversarial attack
• We have conducted a wide range of experimental in order to accept the suggested paradigm. An outline of
analyses and compared our results to previous studies. the PDF was provided in [36], and contemporary attacks on
We also highlight some key observations of our study. PDF malware were carried out using reliable attack models
obtained from nature. They gave an example of how to use
The rest of the paper is organized as follows: Section II
programming skills to perform a quantitative analysis of a
provides an in-depth and organized overview of current
PDF file to look for signs of contained malware. They looked
research in the same field of study. Section III describes
at some of the emerging AI-powered tools for detecting
the recommended approach for PDF malware detection.
PDF malware which may assist computational scientific
Section IV presents and evaluates our findings, as well as
analyses and can flag questionable documents before a more
the experimental outcomes. Section V provides an in-depth
thorough, more conclusive statistical analysis is published.
discussion while pointing out a few observations. Finally,
They looked at the PDF restrictions alongside various
closing remarks are offered in Section VI.
unresolved problems, especially how their flaws might be
used to potentially misdirect measured investigations. Finally,
II. RELATED WORKS they offered advice on how to make those structures more
In recent years, PDFs have been widely used to disseminate effective in withstanding attacks and sketched a possible
malicious documents and malware. To mitigate the subse- assessment.
quent and crucial growth of malicious PDF developments, Obfuscation strategies used by PDF maldoc authors were
numerous effective studies on detecting and categorizing noted by the study in [37]; these techniques hinder automated
technologies for malware and other dangerous files were evaluation and identification methods and make manual
established [19]. The tools that have been designed through- analysis more difficult. This involves exploiting PDF filters,
out the past years range greatly from being general and comments, and white space to spread harmful code across
straightforward to specific and complex. Certain techniques numerous objects. Other strategies include gathering around
try to find differences by scanning the whole file [20]. strewn harmful code fragments throughout the page using
An additional kind of technique searches an intended file for a ‘‘Names’’ dictionary. Furthermore, hazardous substances
resemblance to typical trends found in harmful PDF files [21], can be concealed in odd places like document metadata or
[22], [23], [24], [25]. Another set of tools concentrated on the fields (comments) of annotations. Moreover, memory
extracting, analyzing, and identifying attack methods, for spraying and the use of shellcodes to download malicious
instance, detecting JavaScript-based attacks [26], [27], [28], files or documents were included in the study of [37] for
[29], [30], [31], [32]. Most of these approaches are heavily the classification of PDF-based attacks as JavaScript code
reliant on machine learning methods, including one- and exploits.
two-class Support Vector Machines, Random Forests, and Because a PDF document acts identically on several
decision trees. devices, the authors in [38] developed a detection method
The study in [33] focused on developing an approach based on behavioral inconsistencies on those platforms
to recognize a group of features derived through currently using a software engineering idea. On the other hand,
available tools as well as generated a new group of a malicious document will behave differently depending
features aimed at improving PDF maldoc identification and on the platform. The study in [39] emphasized malware
prolonging the useful life of current analysis and detection inserted into PDF files as a representative example of
techniques. The importance of the produced features was contemporary cyberattacks. They began by classifying the
assessed using a wrapper function that leveraged three key various production processes for PDF malware scientifically.
supervised learning methods as well as a feed-forward They used a proven adversarial AI framework to counter
deep neural network. Subsequently, a novel classifier that PDF malware detectors that rely on learning. This strategy,
significantly improved classification efficacy with shorter for instance, made it possible to discover existing faults in
training times was constructed deploying features of the learning-oriented PDF malware trackers as well as novel
highest significance. With the use of huge datasets from threats that may threaten such architectures, as well as the
VirusTotal [34] the findings were verified. likelihood of protective actions.
From top to bottom, authors in [35] looked into PDF In [40], the authors outlined an innovative approach
design and JavaScript content contained in PDFs. They to detect data problems of an ensemble classifier. The
developed a wide range of features for design and metadata, ensemble classifier’s prediction was shown to be false when
including the number of bytes per second, the encoding enough individual classifier votes clashed during detection.

VOLUME 12, 2024 13835


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

The recommended method, ensemble classifier consensus path. A tree was constructed from the document hierarchy
evaluation, facilitated the findings of various sorts of system and on the basis of the presence of specific paths the harmful
evasions without the necessity for additional external ground and clean files were identified, according to the authors.
truth. The authors tested the suggested approach using However, to combat the existing threats caused by PDF
PDFrate, a PDF malware detector, and revealed that a malware as well as to mitigate the challenges posed by the
significant number of assumptions could be derived utilizing evasive behavior of PDFs, we certainly require an effective
improved ensemble classifier concordance using the entire classifier that works with an explainable feature set covering
network’s data. the wide range of behaviors of PDFs. In this research,
The authors of [25] demonstrated how the least optimistic we developed a dataset that covers the characteristics of
case behavior of a malware detector in terms of specified clean and harmful PDFs along with a limited introduction
intensity features could be examined. Additionally, they dis- to the evasive behaviors of PDFs. Moreover, we identified
covered that creating classifiers with legally verified efficient an explainable feature set by extracting useful features from
features may raise the expense of avoiding unrestrained the PDFs by adopting three well-known PDF analysis tools.
attackers by simply skipping over simple assault avoidance Furthermore, we identified an effective machine learning
techniques. They put forth an alternative distance measure classifier that leverages the newly developed feature set
that relies on the tree structure of PDF and identified two to detect PDF malware with improved accuracy. Finally,
groups of strong features, such as erasures and subtree we provided a thorough explanation of the performance of
inclusions. the classifier by describing a decision tree built from one of
In [32], the researchers presented Lux0R, further referred the estimators of the classifier and extracting a few crucial
to as ‘‘Lux 0n discriminant References,’’ a novel and adapt- decision rules for detecting PDF malware effectively. In the
able approach for detecting malicious code in JavaScript. subsequent section, the details of the methodology used in
The recommended strategy hinged on describing code in this research will be discussed thoroughly.
JavaScript using API references, which contained elements
that a JavaScript Application Programming Interface (API) III. METHODOLOGY
can intuitively comprehend such as objects, constants, PDF files are among the most extensively used file types
functions, attributes, methods, and keywords. To isolate in the world. However, hackers can utilize PDF files, which
suggestive risky code of a certain subgroup from API are usually non-threatening, to introduce security dangers via
references, the proposed approach made use of machine malicious code, just as they can with PNG files, dot-com
learning which was subsequently used to spot JavaScript files, and Bitcoin [4]. As a result, PDF malware appears,
malware. The important application domain that the author demanding techniques for recognizing malicious from benign
focused on in this work was the detection of potentially files. This section discusses the proposed detection system for
harmful JavaScript code in PDF files. The weaknesses within analyzing and categorizing PDF files as benign or malicious.
existent extractors of features for PDFs were uncovered Fig.2 represents the inclusive graphical architecture of the
by the authors of [41] by evaluating them alongside proposed approach utilized to conduct this research. In
analyzing how the framework of the fraudulent documents our proposed approach, initially, we accumulate 29901 raw
was set up. The researchers subsequently developed FEPDF PDF samples from [44] which are originally picked from
(feature extractor-PDF), a sophisticated feature extractor, that Contagio Data Dump [45] and VirusTotal [34]. Then, PDF
was capable of discovering characteristics that traditional samples are divided into Benign, Malicious, and Evasive
extraction methods could lose and recorded accurate data categories according to their preassigned label as mentioned
concerning the PDF components. To investigate the most in [44] and [46]. Then, we choose 15958 PDF samples
recent antivirus frameworks along with pattern extractors, the of Benign, Malicious, and Evasive categories from the
authors created numerous fresh harmful PDFs as samples. 29901 raw samples and develop a comprehensive dataset
The results indicate that a number of existing antivirus for our experimental study. We utilize three up-to-date PDF
applications were unable to identify the fresh dangerous analysis tools viz. PDFiD [16], PDFINFO [17], and PDF-
PDFs, however, FEPDF was able to retrieve the essential PARSER [18] to extract effective standard feature set F1 ,
components for improved dangerous PDF classification. F2 , and F3 respectively from the raw PDF samples of our
In [42], an integrated detection technique was suggested to experimental dataset. In addition to the standard feature set,
track the JavaScript code’s runtime behavior along with the seven more features are derived by carefully observing the
recognition of features related to obfuscation, which included characteristics of PDF samples from the feature set F1 . The
concealing certain keywords’ presence with ASCII hexadec- standard feature sets F1 , F2 , and F3 are then implemented in
imal when several compression filters were used and the the model selection phase, which includes a set of baseline
existence of any void objects. The research in [43] was based machine learning classifiers. The model selection phase
on the odd disparities between harmful and clean document determines the best model amongst the baseline classifiers
construction. The authors adopted the tools that extracted employed in this study based on their effectiveness for each
the feature set utilizing the document hierarchy or structure feature set.

13836 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

FIGURE 2. Proposed architecture for malicious PDF detection.

The extracted derived features are then merged with the 1) PDF SAMPLE COLLECTION
standard feature set F1 , F2 , and F3 respectively, to develop To carry out our experiments, we gather a large corpus of
the derived feature set F1′ , F2′ , and F3′ correspondingly. Later, raw PDF files from [44] which consists of 29901 PDFs.
we aim to generate feature subsets from F1′ , F2′ , and F3′ based The original sources of the PDF files are from two well-
on their importance and employ them in the best-performing known sites i.e. Contagio Data Dump [45] and VirusTotal
model to determine the most effective feature subsets. Finally, [34]. Among the collected PDF files, we get 9109 Benign
we execute a union operation on the three best subsets PDFs from Contagio Data Dump and 20000 malicious PDFs
acquired from F1′ , F2′ , and F3′ to construct the final feature set from VirusTotal. From [44], we also gather 792 evasive PDF
used for malicious PDF detection. The performance of our files among which 400 are labeled as benign evasive and
proposed approach is measured using various performance 392 are labeled as malicious evasive. Table 1 shows the
metrics such as precision, recall, f1-measure, and accuracy. distribution of the collected PDF files with their sources and
Furthermore, we highlight the impact of the final feature set their preassigned label.
and how much it contributes to the classification activities.
In addition, leveraging the strength of the best-performing TABLE 1. Collected PDF files for the experimental study.
model, we offer an explanation to make it more humanly
understandable by extracting some important decision rules
responsible for the classification activities. The details of
our proposed methodology for malicious PDF detection are
described below in the following subsections.

A. DATASET DEVELOPMENT 2) PDF SAMPLE SELECTION


Existing datasets may not represent the entire range of The primary aim of our study is to develop an effective
harmful PDFs. Creating a new dataset enables us to include malicious PDF detector by leveraging the explainable and
a broader range of samples, capturing adversaries’ crafting efficient feature set, thoughtfully extracted from the PDF
approaches and strategies. The world of cybersecurity is files. As a result, we mainly concentrate on sample selection
continuously changing, and attackers are constantly devising for building our dataset based on a few constraints. Firstly,
new evasion strategies. A fresh dataset enables us to the PDF files that preserve the pure malicious activities, for
capture novel circumstances that may not have been present instance, JavaScript are mainly employed by the attackers in
in previous datasets. By focusing on the aforementioned creating malicious PDF documents as an attack method [33],
directions, we aim to create an all-inclusive dataset that we select these types of PDFs exhibiting such malicious activ-
includes not only hazardous and clean PDF samples but also a ities. Secondly, we consider the PDF files that pose opposite
few elusive PDF samples that displayed the opposite features behavior than the malicious ones, for example, legitimate
of their preassigned class which may assist in developing an PDF files usually do not contain JavaScript-related features
effective malicious PDF classifier. that can possibly damage the user’s systems, though it is very

VOLUME 12, 2024 13837


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

much possible that a clean PDF file can be generated using including its title, author, creation date, and more. In our
the nonmalicious JavaScript feature. However, developing operational dataset, we have examined metadata features
a machine learning classifier solely based on both of these such as Filesize_kb, /ID, /CreationDate, /ModDate,
categories can lead to an overfitted model for malicious pages, etc.
PDF identification. Moreover, JavaScript is not only the • Triggering Features: Triggering features refer to partic-
key characteristic that malicious PDF exhibits rather there ular traits or components in a PDF file that can possibly
are other triggering features. For instance, OpenAction, and cause various behaviors or actions, including harmful
Aditional Action (AA) are some of the few features that may ones. Attackers might deploy these features to distribute
indicate potential malicious activity of a PDF [33]. Besides, malware, execute scripts, or perform other malicious
adding too many diversities for both of these categories acts. In our dataset, we carefully analyze triggering
can potentially skew the feature weights and importance features such as /JavaScript, /JS, /OpenAction, /AA
which may cause an ineffective classification accuracy (Aditional Action, /Launch, and so on.
for our proposed model. Considering the aforementioned
reasoning, thirdly, we thoughtfully select a few PDF files
that exhibit malicious behaviors but are labeled as benign 1) PDFiD FEATURES
(benign evasive) and the PDF files that pose the benign PDFiD is a Python-based tool [16] for scanning PDF
behavior but are labeled as malicious (malicious evasive). documents in order to discover specific features and traits that
In this pilot experiment, we concentrate on developing an may signal possible maliciousness. PDFiD does not run any
operational dataset that has a total of 15958 PDF files code inside the PDF; instead, it concentrates on examining
including 7500 benign (clean), 7666 malicious, 400 benign the parts and arrangement of the PDF to shed light on its
evasive, and 392 malicious evasive PDF files, by setting a characteristics. We have gone through all of our dataset’s
limited scope to ensure that we get reliable findings as quickly PDF files and used the PDFiD tool to extract 22 features,
as possible. Table 2 represents the dispersion of the selected as shown in Fig. 3. These extracted features are considered
PDF files for our experimental study. for the standard feature set, F1 . In the following, we describe
the features in brief:
TABLE 2. Selected PDF files for our operational dataset.
• PDF Header: The PDF header is required for appli-
cations and software to appropriately identify and
comprehend PDF documents. The ‘‘%PDF’’ identifi-
cation is followed by a version number in the PDF
header. For instance, ‘‘%PDF-1.3’’ denotes that the PDF
file complies with PDF standard version 1.3. This code
notifies applications and PDF viewers that the document
is in PDF format.
• obj: PDF documents are made up of objects such as
B. STANDARD FEATURE SET EXTRACTION fonts, text, images, forms, etc. The term obj refers to
We adopt three tools viz. PDFiD, PDFINFO, and PDF- the opening of an object definition. This feature provides
PARSER, which extract features from PDF files. Although the total number of obj keywords that can be identified
the three tools serve the same objective, the results they within the PDF structure.
produce are different and can be leveraged to create three • endobj: The term endobj specifies the closing of the
different feature sets. The feature set extracted by the tools object definition. In the case of PDFiD, this feature
can be categorized into the following groups: points out how many times the endobj keyword appears
• Content-Related Features: Content-related features inside the PDF structure.
obtained from the PDF file yield clues regarding the • stream: A stream object is employed in PDF documents
file’s textual and visual content. For instance, features to hold binary data, such as fonts, images, or other binary
like /Image, /Font, /ProcSet etc. are a few examples of material, within the document. This feature provides the
the content-related features, we observe in our dataset. number of stream keywords that exist within the PDF
• Structure Related Features: The structural feature refers file.
to the construction elements utilized to create a PDF • endstream: This keyword denotes the completion of the
document. This type of feature provides an internal stream’s binary data portion. In the context of PDFiD,
relationship and exhibits the hierarchy among vari- this feature indicates how many endstream keywords can
ous elements of PDFs. We have considered various be found within a PDF document.
structural features, for instance, obj, endobj, %EOF, • xref: The xref (cross reference) table assists in maintain-
startxref, trailer etc. in our operational dataset. ing links between the structured objects that are stored
• Metadata Features: Metadata features of a PDF file in PDF files. PDFiD provides the number of xref tables
provide valuable information about the file itself, that exist inside a PDF document.

13838 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

FIGURE 3. A snapshot of the output of PDFiD scanning a PDF file.

• trailer: The trailer is the final component of the PDF • /JBIG2Decode: This feature reveals the number of
file and contains crucial details about the byte offset /JBIG2Decode keywords that exist within the structure
to the beginning of the cross-reference (XRef) table. of a PDF file. The feature explains whether the PDF
In the case of PDFiD, this feature indicates how many uses the JBIG2 compression or not, although it does
trailers can be found within the structure of a PDF not provide any direct indication of maliciousness but
file. requires further analysis.
• startxref: The startxref keyword designates the location • /RichMedia: The feature demonstrates the number of
where the Xref table of the PDF is started. This feature /RichMedia keywords that can be found within the PDF
yields how many times we can find startxref keyword structure that provides an indication of flash files.
inside a PDF. • /Launch: This outputs the number of /Launch keywords
• /Page: This feature indicates the total number of pages that exist within the PDF.
of a PDF. • /EmbeddedFIle: This indicates the number of /Embed-
• /Encrypt: The feature outputs the number of /Encrypt dedFIle keywords that can be found inside the structure
keywords present within the PDF structure. of a PDF.
• ObjStm: The total number of object streams is counted • /XFA: Certain PDF files contain XFAs, which are XML
with /ObjStm. The ObjStm possesses the ability to hold Form architectures that offer scripting capabilities that
other objects, making it useful for hiding things. can be abused by attackers. This feature outputs the
• /JS: The number of objects that contain the /JS keyword number of /XFA keywords that can be observed inside a
which reveals the objects having JavaScript code. PDF file.
• /JavaScript: This feature demonstrates the number of • /Colors: This feature indicates the number of different
objects containing JavaScript code, a common and colors utilized in the PDF structure.
prevalent obfuscation technique.
• /AA: This feature denotes the number of /AA 2) PDFINFO FEATURES
(Additional Action) keywords observed inside a PDF PDFINFO is a command-line program that is a part of the
document. Poppler utility suite commonly used for extracting metadata
• /OpenAction: When a page or document is viewed, from PDF files. We extract 14 features from our operational
an automated action is indicated by the /OpenAc- dataset utilizing the PDFINFO tool as depicted in Fig. 4.
tion command. This feature demonstrates how many These 14 features are considered as the standard feature set
/OpenAction keywords a PDF document has inside its F2 . In the following, we describe the features of the feature
structure. set F2 :
• /AcroForm: The feature denotes the number of /Acro- • Custom Metadata: This feature indicates the presence of
Form keywords that exist within a PDF file. The Acrobat user-defined custom metadata inside a PDF document.
forms used in PDF files can be exploited by the attackers. The feature provides the value as ‘yes’ or ‘no’.

VOLUME 12, 2024 13839


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

FIGURE 4. A snapshot of the output of PDFINFO scanning a PDF file.

• Metadata Stream: This feature outputs the presence of 3) PDF-PARSER FEATURES


metadata stream within a PDF file in the form of ‘yes’ PDF-PARSER is a command-line program and library
or ‘no’. written in Python that parses and analyzes the internal
• Tagged: This feature demonstrates whether the PDF file structure of PDF documents. It is not a PDF creation or
is tagged for accessibility or not. editing tool, but rather one for inspecting the internal layout
• UserProperties: UserProperties are extra characteristics and content of existing PDF files. Though PDF-PARSER
or data that users can add to a PDF file for a variety does not provide features in a direct manner, we have
of functions, including document administration or extracted 27 features from the parsed structure of the PDF
private annotation. This feature reveals whether the PDF as shown in Fig. 5. These features are mainly the keywords
contains any UserProperties or not. frequently observed in the parsed structure of the PDF and are
• Suspects: The feature informs whether any potential considered as the standard feature set F3 . Initially, we have
flaws or errors have been spotted in the PDF document. iterated through all the PDFs of our operational dataset to
• Form: This feature outputs the Form types utilized in the extract the parsed structures. Then, we search for specific
PDF documents. We have observed the XFA, AcroForm, keywords i.e. features from these parsed structures, to create
or none as output from this feature. the feature set F3 . In the following, we introduce these
• JavaScript: This feature informs whether the PDF file features in brief:
contains any JavaScript or not. • /JS: Number of /JS keywords that can be found in the
• Pages: We can observe the total number of pages that a parsed structure of a PDF.
PDF contains with the help of this feature. • /JavaScript: Number of /JavaScript keywords that can be
• Encrypted: The feature demonstrates whether the PDF found in the parsed structure of a PDF.
is encrypted or not. • /Size: Number of /Size keywords that can be observed
• Page size: This feature exhibits the page dimensions of in the parsed structure of a PDF. The /Size keyword
the PDF document. We have encountered PDFs with a indicates the total number of objects present in the PDF
variety of page dimensions, including A4, Letter, A3, document.
and other uncommon page forms. If the shape of the • startxref: Number of startxref keywords that can be
page is peculiar, we have labeled it as miscellaneous, i.e. found in the parsed structure of a PDF.
Page size_miscsize. • %EOF: Number of %EOF keywords that can be
• Page rot: The feature provides the rotation information observed in the parsed structure of a PDF. The keyword
about the pages of the PDF document. is a marker that demonstrates the end of the PDF file.
• File size: This feature outputs the size of the PDF file • /Producer: Number of /Producer keywords that can be
in bytes. However, for the simplicity of the experiment, spotted in the parsed structure of a PDF. The keyword
we have converted the file size to kilobytes. Hence, specifies the tool or software by which the PDF was
we have denoted this feature as Filesize_kb throughout created.
the study. • /ProcSet: Number of /ProcSet keywords that can be
• Optimized: This feature informs whether the PDF noticed in the parsed structure of a PDF. The set of
document is optimized (such as size compression) or procedures (or processes) that should be employed while
not. rendering a page or graphic content within a PDF
• PDF version: The version of the PDF document can be document is specified by /ProcSet. Though this keyword
observed using this feature. does not directly indicate the maliciousness of a PDF,

13840 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

FIGURE 5. A snapshot of the output of PDF-PARSER scanning a PDF file.

the keyword has been frequently encountered inside the • /ModDate: Number of /ModDate entries that can be
parsed structures of clean PDF files. discovered inside the parsed structure of a PDF. The
• /ID: Number of /ID keywords that can be discovered modification date and time of the PDF file are specified
in the parsed structure of a PDF. This keyword reveals using the /ModDate keyword.
the document ID that is crucial for the integrity and • /Info: Number of /Info keywords that can be spotted
security of the document which can indicate whether the inside the parsed structure of a PDF. The term /Info
document was tampered with malicious activity or not. describes the document’s information dictionary.
• /S: Number of /S keywords that can be spotted in the • /XML: Number of /XML entries that can be discovered
parsed structure of a PDF. The keyword indicates the inside the parsed structure of a PDF.
subtype of various objects or tasks, such as text or link • Comment: Number of comments that are noticed inside
annotations. the parsed structure of a PDF.
• /CreationDate: Number of /CreationDate keywords that • /Widget: Number of /Widget keywords that are found
can be discovered in the parsed structure of a PDF. within the parsed structure of a PDF. The /Widget anno-
• obj: Number of objects that can be spotted inside the tations are interactive components that are employed in
parsed structure of a PDF. PDF files, particularly PDF forms, which enable users
• xref: Number of xref that can be observed within the to interact with input data.
parsed structure of a PDF. • Referencing: Number of Referencing keywords that are
• ≪: Number of ’≪’ keywords that can be noticed in the noticed inside the parsed structure of a PDF.
parsed structure of a PDF. In a PDF file, the ’≪’ signifies • /FontDescriptor: Number of /FontDescriptor keywords
the start of a dictionary object. that are discovered within the parsed structure of a PDF.
• ≫: Number of ’≫’ keywords that can be noticed in the • /Image: Number of /Image keywords that can be found
parsed structure of a PDF. In a PDF file, the ’≫’ signifies within the parsed structure of a PDF.
the closing of a dictionary object. • /Rect: Number of /Rect keywords that are observed
• /Font: Number of /Font entries that can be discovered within the parsed structure of a PDF.
inside the parsed structure of a PDF. • /Length: Number of /Length keywords noticed within a
• /XObject: Number of /XObject keywords that can be PDF’s parsed structure. The /Length keyword specifies
observed within the parsed structure of a PDF. The the length or size of the content stream related to a PDF
/XObject keyword is utilized to indicate and encapsulate object in bytes.
external graphical material such as images, forms, and • /Action: Number of /Action keywords noticed within a
other sophisticated objects. PDF’s parsed structure.

VOLUME 12, 2024 13841


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

C. DERIVED FEATURES we select M number of baseline machine learning classifiers


We have derived seven more features by carefully observing including Random Forest, C5.0, SVM, J48, AdaBoost, Deep
the characteristics of the PDFs from the standard feature set Neural Network (DNN), Gradient Boosting, and KNN for our
F1 . The derived features are introduced at the following: experimental study. We iterate through each feature set in F,
• Headerlength: The feature considers the length of the for each classifier in M , to evaluate each classifier on each
filename i.e. length of the title of the PDFs. feature set based on 10-fold cross-validation and generate
• Headercorrupt: This is a binary feature that considers the classification report. We compare the classification report
the version of the PDFs. If the version of any PDF does yielded for each feature set and choose the best-performing
not start with %PDF-1.X where X=[0,1,. . . .,7], then the model for PDF malware detection based on the report.
feature value is set to 1 and 0 otherwise.
• Small content: This binary feature is derived through E. FINAL FEATURE SET
the careful observation of the number of objects in a After we generate the derived feature sets F1′ , F2′ , and F3′ by
PDF. If a PDF has objects less than or equal to 14 then merging the derived features with the standard feature sets
the feature is set to 1 and 0 otherwise. We have observed F1 , F2 , and F3 respectively, we concentrate on developing
the mean value grouped by class (malicious and benign) the final feature set. The intuition behind building the final
for the number of objects present in the PDFs. The feature set is to create an effective single feature set by
mean value for malicious PDFs is 14.1 whereas for the observing all the derived feature sets. Thus, we measure the
benign ones, the mean is 85.6. Furthermore, we have feature importance, rank the features for each derived feature
spotted 6277 malicious PDFs which is 77.90% of set, and generate subsets based on the rank of the features.
the entire malicious PDF sample that fall under or By leveraging the best-performing model, we implement each
equal to this threshold value of 14. On the other feature subset utilizing the model to find out the effectiveness
hand, we have discovered 1709 benign PDFs which of the subset. We iterate through derived feature sets F1′ , F2′ ,
is 21.6% of the entire benign PDF sample, also fall and F3′ respectively to create subsets by taking top features
under the same constraint. Moreover, we have observed from each one of them. After we complete the iteration
22.10% malicious PDFs that do not meet the threshold for subset generation and their evaluation utilizing the best-
requirement. However, through thoughtful inspection, performing model, we compare the performance of the model
we set the threshold value as 14 for this binary feature. among the subsets of each derived feature set to find the
• Content corrupt: This binary feature is set to 1 if the best feature subset. We take the best feature subset from
number of objects and endobjects of the PDFs are not each derived feature set F1′ , F2′ , and F3′ respectively, and
the same and 0 otherwise. perform the union operation among them to develop the final
• Stream corrupt: If the number of streams and end- feature set. This newly developed final feature set is then
streams are not the same then this feature is set to 1 and utilized to conduct the final classification activities to detect
0 otherwise. PDF malware. Algorithm 1 explains the overall approach
• Malicecontent: This binary feature is set to 1 if two of the final feature set generation. In the following section,
features from /JS, /JavaScript, /AA (Aditional Action), we discuss the experimental results obtained in this research
/Launch, /OpenAction are found at least for a single for PDF malware detection.
instance within a PDF and 0 otherwise. These features
pose a risk since they can be exploited to insert and
execute malicious code within a PDF document. IV. EXPERIMENTAL RESULTS
• Hidden File: The PDFiD tool while scanning a PDF In this pilot experiment, we aim to 1) build an efficient
file indicates if there is any hidden file embedded by and improved feature set for identifying PDF malware,
the adversaries within the document. This binary feature 2) empirically explore various baseline machine learning
is set to 1 if there is any hidden file found within the classifiers to select an effective machine learning classifier
document and 0 otherwise. that can leverage the freshly created feature set to identify
PDF malware with an improved detection accuracy, 3)
To find out the effectiveness of the derived features for mali-
explain how much the features of the final feature set
cious PDF detection, we have merged these derived features
contribute to the classifier to detect maliciousness in PDF and
with the standard feature set F1 , F2 , and F3 respectively to
4) extract a few crucial decision rules leveraging the power
generate derived feature sets F1′ , F2′ , and F3′ correspondingly.
of the classifier that is easily understood and interpretable by
humans to aid in the detection of potential maliciousness in
D. MACHINE LEARNING MODEL SELECTION PDF. To accomplish the objectives, we carry out experiments
In this pilot study, we intend to develop an effective based on the following cases:
data-driven approach based on machine learning to detect
malicious PDFs. To select an effective machine learning • Case I: In this instance, we mainly concentrate on
model, at the first step, we initialize the standard feature finding the answer to the following questions: 1) How
set F that contains F1 , F2 , and F3 feature sets. Then, do the standard feature set F1 , F2 , and F3 assist in

13842 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

Algorithm 1 Steps For Final Feature Set Generation In addition, we utilize the strength of the classifier to
Input: Derived Feature Sets, F ′ = {F1′ , F2′ , F3′ } discover a few key decision rules that humans can understand
Output: Final Subset containing the union of best subsets easily and apply to identify potentially dangerous PDF
1: for f in F ′ do content.
2: Find the importance of features in f
3: Sort the features in f based on importance A. EVALUATION METRICS
4: Generate subset S = {S1, S2, S3, . . .} by taking top We employ the abbreviations for the evaluation metrics listed
features from f below to analyze the classification report:
5: Initialize best_subset as an empty set • Acc: The term accuracy is abbreviated as Acc, and it can
6: for s in S do be assessed using the following formula:
7: Apply s on the Best ML Model Selected for PDF
TP + TN
Malware Detection Accuracy =
8: Generate Classification Report TP + FN + TN + FP
9: if subset performance is better than best_subset where TP means True Positive, FP means False Positive,
performance then TN means True Negative, and FN stands for False
10: Set best_subset to s Negative.
11: end if • Pr: The term precision is abbreviated as Pr can be
12: end for measured by
13: Perform a union operation to append best_subset to TP
Final_Subset Precision =
TP + FP
14: end for
• Rec: The abbreviation Rec is used in place of the term
Recall, which can be quantified by
TP
identifying PDF malware? and 2) What is the suitable Recall =
TP + FN
machine learning model for PDF malware detection?
• Case II: In this scenario, our key focus is to uncover the • F1: The term F1-Score is denoted as F1, which can be
findings of the following queries: 1) What is the impact calculated by using the
of the derived feature sets in identifying PDF malware? Pr ∗ Rec
and 2) How do the derived feature sets F1′ , F2′ , and F3′ F1 = 2 ∗
Pr + Rec
contribute to the detection of PDF malware?
• Case III: In this case, we investigate the answers to the B. CASE I
following questions: 1) What features are selected for In this case, we look at the impact of standard feature sets F1 ,
the final feature set? and 2) Does the final feature set F2 , and F3 , as well as baseline machine learning classifiers,
boost the performance of the classifier? to determine the top-performing model for detecting PDF
• CASE IV: In this particular circumstance, we look into malware. Table 3 demonstrates the performance of the
finding the answers to the following queries: 1) How various baseline machine learning classifiers along with a
does the combined feature set (i.e. F1 + F2 + F3 + derived deep neural network (DNN) for PDF malware detection
features which also can be represented by F1′ UF2′ UF3′ utilizing the standard feature set F1 , F2 , and F3 based on
) help in the detection of malicious PDF? 2) Will the 10-fold cross-validation. Conspicuously, we can observe that
best feature subset generated from the combined feature the Random Forest classifier yields the best accuracy for all
set by taking into account the feature importance and the standard feature sets compared to the baseline classifiers
approach described in CASE III differ from the final while identifying PDF malware. We utilized Scikit-Learn,
feature set acquired in CASE III? 3) Will the best a well-known open-source machine-learning library for
feature subset produced in this scenario have a positive Python to implement the baseline classifiers. We constructed
or negative impact on classification performance? and the Random Forest classifier with 100 estimators and with
lastly 4) What is the classification performance when no random_state = 42 to handle the randomness. On the other
derived features are used, such as only with F1 + F2 + hand, we built the C5.0, SVM, J48, AdaBoost, and KNN
F3 ? classifiers with their default hyperparameters as per the
• Case V: In this instance, we focus on explaining i.e. Scikit-Learn library.
How does the freshly developed final feature set aid Furthermore, we developed the DNN model which is
the classifier in detecting PDF malware? Furthermore, an MLPClassifier, with a hidden layer of 100 units and
we present an analysis of the distribution of the random_state = 42. We executed the DNN model for
characteristics of the final feature set in the operational 100 epochs. And finally, the Gradient Boosting classifier
dataset to identify a few prospective directions that may (GBC) was introduced with the 100 estimators and ran-
effectively aid in identifying PDF malware. dom_state = 42. While implementing the standard feature

VOLUME 12, 2024 13843


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

set F1 , we clearly witness that the Random Forest classifier


stands out as the top-performing model for identifying PDF
malware by acquiring an accuracy of 96.82% compared to
the other models used in this work. Upon examining the
other classifiers’ performance, we find that the C5.0 clas-
sifier achieved the second-best accuracy for malicious
PDF identification, with 96.59%. However, we find SVM
classifier yielded comparatively less effective performance
for identifying the malicious PDFs. On the other hand, the
J48, AdaBoost, GBC, KNN, and DNN models provided an
accuracy of 96.57%, 95.92%, 96.49%, 95.77%, and 96.20%
respectively for PDF malware detection. Similarly, we notice FIGURE 8. ROC curve comparison of random forest model with various
that the Random Forest model outperforms the baseline classifiers adopted in this study on the standard feature set F3 .

classifiers by showing an accuracy of 96.53% and 97.19%


for detecting PDF malware utilizing the standard feature sets Table 4 lists the feature importance of the standard feature
F2 and F3 respectively. Fig. 6, 7, and 8 explicitly exhibit the set F1 , F2 , and F3 respectively while implementing the
ROC curve comparison of the classifiers implemented in this Random Forest model for PDF malware detection. From
study on the standard feature sets F1 , F2 , and F3 respectively. Table 4, we identify the key features of the Random Forest
We can identify that the Random Forest classifier provided model which aid in achieving a better performance of the
the best area under the curve (AUC) score compared to model compared to the other baseline classifiers. The features
the others for PDF malware detection in each of these /JS, startxref, and JavaScript are the most important features
standard feature sets. Among the three standard feature sets, of F1 whereas the features Filesize_kb, Metadata Stream,
we encounter that the feature set F3 turns out to be the best and Optimized are the most important features from F2 .
for obtaining better performance of the model. To verify Besides, we identify that the /JS, /JavaScript, and /Producer
the intuition of the Random Forest model’s performance, are the top three important features from the feature set
we investigated the feature significance of the standard F3 . Leveraging the effectiveness of the Random Forest
feature sets using the model directly. classifier in high-dimensional feature space as well as the
decision-making capability based on the ensemble learning
method and certainly observing the aforementioned empirical
analysis utilizing the standard feature sets, we consider the
Random Forest model as the best-performing model to detect
PDF malware. Thus, for further case studies, we take only the
Random Forest model to execute our desired experiments.

C. CASE II
In this case, we discover the impact of the derived feature
sets on the classifier’s performance as well as how much the
derived features incorporate for identifying PDF malware.
Table 5 highlights the performance of the Random Forest
FIGURE 6. ROC curve comparison of random forest model with various classifier utilizing both the standard and derived feature sets
classifiers adopted in this study on the standard feature set F1 .
based on 10-fold cross-validation. The findings explicitly
demonstrate that the derived feature set noticeably improved
the effectiveness of the classifier compared to the standard
feature sets. We observe a nearly 2% increase in accuracy for
the classifier when utilizing the derived feature set F1′ instead
of the standard feature set F1 . Similarly, the other derived
feature sets F2′ , and F3′ maintain the same consistency for
the accuracy improvement of the classifier as like as the F1′ .
However, for the derived feature set F3′ , we obtain the best
classification report by attaining an accuracy of 98.90% of
the classifier for the detection of PDF malware.
To further assess the significance of the derived features,
we estimate the feature importance of the derived feature
FIGURE 7. ROC curve comparison of random forest model with various sets F1′ , F2′ , and F3′ leveraging the power of the classifier.
classifiers adopted in this study on the standard feature set F2 . Table 6 shows the feature importance of the derived

13844 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

TABLE 3. An investigation of the Accuracy, Precision, Recall, and F1 - Score of various machine learning methods for standard feature sets utilizing tools
(PDFiD, PDFINFO, and PDF-PARSER) based on 10-fold cross-validation.

feature sets and explicitly exhibits the significance of the final feature set through careful observations. Besides,
derived features and how much they contribute during the we investigate the effect of introducing the final feature set
classification. The results reveal that the derived feature on the classifier’s performance. Furthermore, we estimate the
Headerlength contributes most for all the derived feature sets significance of the final feature set in the identification of
i.e. the feature Headerlength contributes 28.19%, 34.15%, PDF malware.
and 30.85% when F1′ , F2′ , and F3′ are utilized respectively To identify the best feature subsets from the derived feature
for the classification activities, Likewise, another important sets we follow the steps mentioned in the Algorithm 1.
derived feature Malicecontent turns out to be in top three We use the findings highlighted in Table 6 where the feature
in terms of its significance among all the features within importance of each derived feature set is estimated and then
the derived feature sets F1′ and F2′ by exhibiting 12.6%, sorted according to their importance. We consider the features
and 17.10% contributions respectively for the classification with at least 1% of the feature importance score to generate
purpose. However, we observe the Malicecontent feature as subsets from the derived feature sets F1′ , F2′ , and F3′ . Thus,
the second most important feature yielding 9.7% significance we find the top 15 features from F1′ . top 10 features from F2′ ,
when the classifier utilizes the F3′ for the identification of PDF and top 18 features from F3′ that satisfy the aforementioned
malware. Apart from the Headerlength and Malicecontent condition, and these features are selected for the generation
derived feature, we encounter seldom contributions from of feature subsets initially. We generate the subsets starting
other derived features, though we find the small content from the first feature and gradually increase the features
feature assisting a little to the classifier among the rest of the sequentially up to the last feature considered for feature
derived features. subset generation from each derived feature set.
Therefore, we build 15 subsets from F1′ , 10 subsets
D. CASE III from F2′ , and 18 subsets from F3′ respectively. Further,
In this instance, we identify the features from the derived we follow the steps as mentioned in Algorithm 1 to estimate
feature sets F1′ , F2′ , and F3′ that are selected for the the effectiveness of each feature subset to select the best

VOLUME 12, 2024 13845


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

TABLE 4. Feature importance of standard feature sets F1 , F2 , and F3 .

TABLE 5. Improvement of random forest classifier’s performance using accuracy of 97.91% when applying the subset consisting of
derived feature sets compared to the standard feature sets for detecting
PDF malware based on 10-fold cross-validation. 8 top features from F2′ , and there is a considerable variation
in the mean accuracy for other subsets of F2′ . However,
we notice in Fig. 11 that the classifier achieves the highest
mean accuracy when utilizing all of the top 18 features
considered for subset generation from F3′ . Thus, the subset
comprising the top 11 features of F1′ is identified as the best
feature subset from F1′ , the top 8 features of F2′ is identified as
the best feature subset from F2′ , and the top 18 features of F3′ is
identified as the best feature subset from F3′ . To accommodate
feature subset from each derived feature set for final feature the commonness as well as the uncommonness among the
set generation. We highlight the mean accuracy obtained newly identified best feature subsets, we perform a union
utilizing the aforementioned approach for each feature subset operation to generate the final feature set. Table 7 represents
of F1′ . F2′ , and F3′ in Fig.9, Fig. 10, and Fig. 11 respectively. the list of the identified features that are finally considered
According to the depiction in Fig. 9, the classifier has a for the final feature set. We discover three derived features
maximum mean accuracy of 98.69% when implementing the such as Headerlength, Malicecontent, and small content into
subset consisting of 11 top features from F1′ , and there is a the final feature set. Since the JavaScript feature is found
small variation in the mean accuracy for other subsets of F1′ . in all three derived feature sets F1′ , F2′ , and F3′ , we only
Likewise in Fig. 10, the classifier yields a maximum mean consider this feature once (from F1′ ) in the list. The features
13846 VOLUME 12, 2024
G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

/JS, startxref, xref are noticed both in F1′ , and F3′ , so we take
them only once (from F1′ ) in the list of the final feature set.
We observe the features obj, endobj, stream, /OpenAction,
and /XFA only from F1′ in the final feature set. Besides the
features Filesize_kb, MetadataStream, Optimized, and Pages
are encountered only from F2′ , and the rest of the features
listed in Table 7 are observed only from F3′ . We investigate
the impact of the final feature set on the Random Forest
classifier based on 10-fold cross-validation to detect PDF
malware. Table 8 shows the findings of the classifier for
several types of feature sets used in this research. We notice FIGURE 11. Mean accuracy of random forest classifier vs top feature
an impressive increase in the accuracy of the classifier due subset of the derived feature set F3′ .
to the utilization of the final feature set compared to the
standard and derived feature sets. The maximum accuracy
improvement for the classifier when employing the final folds of the 10-fold cross-validation is depicted in Fig. 13.
feature set is 2.71% compared to the standard feature set We observe the maximum loss during the third fold whereas
F2 . On the contrary, we find that the minimum accuracy the minimum loss during the first fold from the entire 10-fold
improvement of the classifier when executing the final feature cross-validation. Fig.14 represents the Reciever Operating
set is 0.34% compared to the derived feature set F3′ . However, Characteristics (ROC) curve of the Random Forest model for
the classifier provides a noticeable performance boost in the the various folds of the 10-fold cross-validation on the final
case of PDF malware detection, due to the introduction of the feature set. We notice an area under the curve of 1.00 for
freshly developed final feature set. the Random Forest model throughout the entire 10-fold
cross-validation.

FIGURE 9. Mean accuracy of random forest classifier vs top feature


subset of the derived feature set F1′ .

FIGURE 12. Accuracy curve of random forest model on final feature set
based on 10-fold cross-validation.

FIGURE 10. Mean accuracy of random forest classifier vs top feature


subset of the derived feature set F2′ .

Fig. 12 illustrates the accuracy curve of the Random


Forest classifier implemented using the final feature set
based on 10-fold cross-validation. We observe that the model
attains 99.56% accuracy during the sixth fold whereas the
model yields the minimum accuracy of 99.05% during the
ninth fold of the 10-fold cross-validation. The log loss FIGURE 13. Loss curve of random forest model on final feature set based
curve of the Random Forest model during the various on 10-fold cross-validation.

VOLUME 12, 2024 13847


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

TABLE 6. Feature importance of derived feature sets F1′ , F2′ , and F3′ .

E. CASE IV when no derived features are used for malicious PDF


In this circumstance, we present the findings of implementing detection.
the combined feature set (i.e. F1 + F2 + F3 + derived Table 9 explicitly depicts the performance of the Ran-
features) for detecting PDF malware. We identify the best dom Forest classifier when utilizing the combined feature
feature subset from the combined feature set by considering set with derived features as well as the merged feature
the approach as stated in CASE III and discover the difference set with no derived features and presents a comparison
with the final feature set. Moreover, we find the impact of of the classifier’s efficacy with the final feature set.
the best subset obtained in this case on the classification We observe that the classifier acquired 99.19% accuracy
performance. Also, we discuss the potency of the classifier when utilizing the combined feature set with derived features

13848 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

TABLE 7. List of identified features of the final feature set. TABLE 9. Impact of combined feature set as well as the feature set with
no derived features (i.e. F1+ F2+ F3) on the random forest classifier’s
performance based on 10-fold cross-validation for PDF malware
detection.

(i.e. F1 + F2 + F3 ) with no derived features. Thus, we can


notice the impact of the derived features in enhancing the
performance of the classifier for at least 1.91% for detecting
PDF malware. However, we discovered comparatively better
efficacy of the classifier for the utilization of the final feature
set to identify malicious PDFs. The reason behind this is that
the combined feature set consists of comparatively a larger
number of features than the final feature set. Because of the
higher number of features, in the high-dimensional spaces,
the classifier sometimes may find patterns in noise rather than
genuine relationships and provides somewhat less effective
performance. Furthermore, the elimination of unnecessary
features can improve the model’s ability to generalize and
TABLE 8. Impact of final feature set on the random forest classifier’s
performance based on 10-fold cross-validation for PDF malware classify malicious PDFs effectively. Thus, the final feature
detection. set with less but by taking into account the important features
based on feature importance produced better accuracy than
the combined feature set.
Table 10 represents the feature importance of the top
features of the combined feature set obtained utilizing the
strength of the classifier. To generate the subset, from the
table, initially, we consider the features having at least 1%
feature importance score (as mentioned in CASE III). Thus
to construct the best subset of the combined feature set,
we evaluate only the top 10 features from Table 10. Then,
we develop 10 subsets from these features by following
the approach as described in CASE III. The mean accuracy
obtained from these subsets adopting the classifier is
illustrated in Fig. 15. We notice that the subset containing
all the top 10 features produced the best mean accuracy
of 98.53%. Therefore, we identify the subset having the
features ( Malicecontent, /JavaScript, Filesize_kb, /Producer,
Headerlength, /S, /ProcSet, /ID, startxref, /Info ) as the best
subset of the combined feature set. Conspicuously, we can
strongly differentiate between the final feature set and the
best subset obtained from the combined feature set. However,
we note that the best subset achieved in this case does
not improve the classifier’s accuracy when compared to the
FIGURE 14. ROC curve of random forest model on final feature set based classifier’s performance for the final feature set.
on 10-fold cross-validation.
Furthermore, to validate the effectiveness of the final
feature set we utilize the Correlation-based Feature Selection
(i.e. F1 + F2 + F3 + derived features) for identifying (CFS) technique to generate the best feature subset from
malicious PDFs. On the other hand, the classifier yielded the combined feature set. We implement the CfsSubsetEval
an accuracy of 97.28% for the merged featured set feature selection method (with Best First Search approach)

VOLUME 12, 2024 13849


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

using a popular machine learning software Weka 3 [47]. TABLE 10. Feature importance of top features of combined feature set.
We find the following features in the best subset as output
from the method: Headerlength, contentcorrupt, /Encrypt,
Malicecontent, /Colors, Metadata Stream, Optimized, Page
size:_A4, Page size:_miscsize, /Size, and /Action. Similarly,
we also implement the ReliefFAttributeEval feature selection
method (with Ranker approach) to produce the best subset
from the combined feature set using Weka 3. We consider
the features having at least a 1% merit score and then
adopt the approach as mentioned in CASE III to develop
and evaluate the feature subset. Finally, we identify the best
subset from this approach having the following features:
Headerlength, Optimized, Malicecontent, Metadata Stream,
Tagged, /EmbeddedFile, Custom Metadata, Form:_none,
/FontDescriptor, /XFA, /Font, small content, /Producer,
/AcroForm, /ModDate, %EOF, Form:_XFA, /XML, /Action,
/CreationDate, and xref. To further evaluate the potency of
the final feature set, we employ these feature sets using the
classifier’s strength and compare the results. We find that for
the subset of CfsSubsetEval, the classifier yields an accuracy
of 97.59% whereas for the subset of ReliefFAttributeEval, the
classifier provides 98.75% accuracy. Notably, we identify that
the classifier produces the highest accuracy when using the
final feature set to detect malicious PDFs.
Overall, we find that from CASE I, the classifier delivers
the highest accuracy of 97.19% on the standard feature set
F3 , from CASE II, the classifier yields the highest accuracy
of 98.90% on the derived feature set F3′ , from CASE III the classifier which is illustrated in Fig. 16. This illustration
classifier produces the best accuracy of 99.24% utilizing the uncovers the important features and how much they con-
final feature set, and from CASE IV the classifier outputs the tribute to the classification activities. The illustration reveals
highest accuracy of 99.19% on the combined feature set with that the derived features Headerlength, and Malicecontent
derived features. Thus, conspicuously we can identify that the contribute largely to the identification of PDF malware.
final feature set assists the classifier to deliver the highest On the other hand, /JS and /Javascript features are also proven
efficacy for detecting PDF malware among all the feature to be very crucial for malicious PDF detection.
sets. We observe all the features from the newly developed final
feature set to explore the traits of both categories of PDFs
toward these features in our operational dataset. The average
value of the derived feature Headerlength is 16.50 with a
standard deviation of 12.45 for the benign PDFs whereas for
the malicious ones, the average value is 42.48 with a standard
deviation of 7.28 as depicted in Fig.17. The illustration also
reveals that the title length of 75.83% of the benign PDFs is
under or equal to the mean value of benign ones while on the
other hand 89.16%, malicious PDFs satisfy the condition of
their title length less or equal to their mean value. However,
this explains the fact that in our operational dataset, the
average title length of the benign PDFs is much smaller
FIGURE 15. Mean accuracy of random forest classifier vs top feature
subset of the combined feature set. than the malicious ones. This finding provides a potential
indication of identifying PDF malware by just looking at
the length of the title of the PDF, though this feature alone
F. CASE V does not necessarily point to the maliciousness within a PDF.
In this case, we provide an explanation of how the freshly Because, in a real-world scenario, a clean PDF often may
created final feature set contributes to the classifier for have a large title length.
identifying maliciousness in PDF. To analyze the significance The distribution of another derived feature Malicecontent
of the final feature set, we estimate the importance of the that is constructed through inspecting the triggering features
top features from this feature set utilizing the Random Forest across the malicious and benign PDFs is illustrated in Fig. 18.

13850 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

FIGURE 16. A bar plot to display the importance of the top features of the final feature set.

/JavaScript feature, a popular choice by cyber attackers to


build a PDF maldoc plotted in Fig. 19 through the proper
inspection of our operational dataset. The finding explains
that the clean PDFs close to 92%, hardly have JavaScript
features while nearly one-third of malicious PDFs from our
dataset contain this feature. The presence of this feature
within a PDF exhibits the strong possibility of malicious
activity. Similar to the /JavaScript feature, we discover very
little presence of /JS feature among the clean files but in the
case of malicious ones, we observe frequent presence of this
keyword.
In Fig. 20, we portray the traits of the FIlesize_kb feature
that explicitly reveals that the average size of the clean PDFs
is much larger than the harmful ones. We discover a high
standard deviation for the file size of the benign PDFs, as well
as the fact that only 68.77% of clean PDFs have a file size less
FIGURE 17. Characteristics of Headerlength feature for both benign and than or equal to their mean size. We also detect a substantial
malicious PDFs in the operational dataset. standard deviation for the malicious PDFs, with nearly 10%
of the malicious PDFs file size falling outside of their mean
bounds. The mean value of the /Size keyword is 2.27 for
During our pilot study, we noticed 7246 malicious PDFs iden- benign PDFs and 1.00 for malicious PDFs, emphasizing
tified by this feature whereas 651 benign PDFs also fell under that attackers try to embed their intended payload inside
the same condition. This demonstrates that malicious PDFs PDFs rather than focusing on the content of the PDFs. The
mostly contain triggering features compared to the clean metadata features /Producer, /ID, /CreationDate, and /Info
ones which also point out a potential direction of identifying are commonly observed in clean PDFs but are rarely observed
PDF malware in real-world cases. The characteristics of the in the hazardous ones in our operational dataset. Likewise,

VOLUME 12, 2024 13851


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

the feature MetadataStream is spotted more often in the clean


ones compared to the malicious ones.
This finding leads to the possibility that hazardous PDFs
include fewer metadata features than clean PDFs. Fig.21
depicts the distribution of the obj characteristic across both
PDF categories. The average number of objects for the clean
PDFs is 85.61, with a significantly high standard deviation of
169.81, whereas the hazardous PDFs have a comparatively FIGURE 18. Distribution of Malicecontent feature for both benign and
malicious PDFs in the operational dataset.
small mean of 14.11, with a standard deviation of 21.89. This
underscores the fact that harmful files typically contain fewer
objects since the attacker’s objective is to create malicious
material with as few objects as feasible in order to execute
their attack as quickly as possible. We notice a similar pattern
with the endobj feature, as every object declaration should be
followed by an endobj, ideally. However, we inspect a small
variation in the mean size for the endobj feature compared
to the obj feature for malicious PDFs, with a mean value
of 16.50. For the stream feature, we observe that malicious
files occupy a limited number of streams compared to the
clean files. Attackers exploit startxref and xref as well to
avoid detection. For a PDF document, a reader program will
render and show it as follows: The EOF (End Of File) mark
at the bottom of the document serves as the starting point
for reading. It is going to be the startxref preceded by the
offset of the cross-reference table immediately on above. The
offset of the root dictionary, which serves as the starting point
FIGURE 19. Characteristics of /JavaScript feature for both benign and
of the hierarchical structure (PDF file), by which all objects malicious PDFs in the operational dataset.
can be retrieved, is contained in the cross-reference table.
If either the xref or the startxref are missing, stringent readers
and parsers will reject the file as malformed. A versatile
reader, on the other hand, will be capable to navigate the
root dictionary and render the file, much like contemporary
readers. The features %EOF can also be manipulated by the
attackers to perform malicious activities.
The features /XFA and /OpenAction are observed to be
more prominent in malicious PDFs than in clean files in
our experimental dataset. Because the clean PDFs contain
more objects and are larger in size than the hazardous ones,
we discover that the features /Font, Referencing, XML, /Rect,
/S (subtype of objects or tasks), /ProcSet (set of procedures)
are more prevalent in the clean files than in the malformed
ones. Furthermore, the average number of pages in clean
files is 5.79, whereas it is 1.44 in malicious files. This
demonstrates that harmful files in our operational dataset
are rather short, generally consisting of one or two pages FIGURE 20. Characteristics of Filesize_kb feature for both benign and
malicious PDFs in the operational dataset.
with limited information (often a blank page). This discovery
points to a possible path for spotting questionable PDF files.
We inspect a little significance of the feature Optimized extracting a few important decision rules from the imple-
and the derived feature small content throughout our entire mentation of the classifier to detect PDF malware. Moreover,
operational dataset for detecting malicious PDFs. we provide interpretations of the decision rules so that
they can be easily comprehended. To decode the classifier’s
G. RULE DISCOVERY AND HUMAN INTERPRETATION performance, in Fig. 22, we illustrate one of the decision
We explore the interpretation of the Random Forest classifier trees from the 100 estimators of our Random Forest classifier.
i.e. how the classifier predicts maliciousness in PDF with The generated decision tree provides an explanation of the
the help of the final feature set by constructing a decision performance of the classifier by representing the conditions
tree from one of the estimators of the classifier as well as of the various features from the final feature set in its nodes

13852 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

does not have Malicecontent feature but contains metadata


feature /ID and its title length is less than or equal to 39.5,
then the sample belongs to the benign class. Fig. 23 illustrates
a decision plot of the Random Forest classifier for one of the
instances that satisfies the above decision rule and belongs to
the benign class. We adopted the SHAP library to construct
the decision plot which renders the decision-making process
easier to understand by highlighting how each feature affects
the output of the classifier. The plot highlights the features
that push the model toward classifying benign or malicious
classes for a particular instance. The blue region (left side of
the vertical line) of the plot represents the features that push
the classifier’s prediction toward the benign class whereas
the red region (right side of the vertical line) highlights
the features that push the classifier’s prediction toward the
malicious class. From Fig.23, notably, we can observe that
FIGURE 21. Distribution of Obj feature for both benign and malicious the Malicecontent, Headerlength, /ID etc. features push the
PDFs in the operational dataset. classifier towards the benign class prediction.
Similarly, we discover another decision rule that specifies
If (Malicecontent > 0.5 and stream <= 24.5 and /XFA >
for detecting malicious or benign PDFs. As we observe the 0.5 and Headerlength <= 52) then Class : Malicious. This
tree, we notice that each node of the tree specifies a feature means that if a PDF sample from our dataset contains the
with a certain threshold condition which is used to split Malicecontent feature with a number of streams less than or
the samples and also mentions the percentage of samples equal to 24.5, as well as the XFA form and its title length less
reached to the node. Moreover, each node also provides the than or equal to 52, it belongs to the malicious class. Fig. 24
proportionate class distribution of the samples that reached visualizes the decision plot for one of the instances of the
the node and indicates the final class label that has the above decision rule where we can notice that Headerlength,
majority vote. The feature in the root node indicates the most /XFA, Malicecontent, stream etc. features push the classifier
important feature of the tree. If the condition in the root node towards the malicious class prediction.
is true then the control transfers to the left child of the root We discovered a total of 230 decision rules from the
node or to the right child otherwise. This process continues illustrated decision tree of the Random Forest classifier that
for each node until we reach the leaf node which indicates a explains the predictions for detecting PDF malware. Since
decision rule specifying a certain class label. our Random Forest classifier has 100 estimators, to further
In Fig. 22, we find that the Malicecontent feature is in investigate the decision rules, we generate all the decision
the root node of the tree specifying a threshold condition trees of the classifier and derive all the potential decision
of Malicecontent <= 0.5 which is used to split the rules from the trees. Fig, 25 depicts the number of decision
PDF samples. The condition implies that the initial decision rules discovered from each of the decision trees originating
point in the tree is based on the Malicecontent feature and from the Random Forest classifier. We identify a total of
evaluates whether or not its value is less than or equal to 23183 decision rules from the 100 estimators of the classifier.
0.5. We observe that the root node deals with all the samples The findings also demonstrate that the maximum number of
considered for the tree. The values in square brackets show decision rules (i.e. 333) is obtained from the tree_index =
the proportionate distribution of classes at this node. The 53 whereas the minimum number of decision rules (i.e. 162)
first value indicates 49% of benign occurrences, whereas is discovered from the tree_index = 07. We observe a mean
the second value specifies 51% presence of malicious class of 231.83 with a standard deviation of 34.11 for the number
among the samples that reached the root node. We find of decision rules extracted from the decision trees of the
the majority vote for the malicious class at this node. The classifier. This implies the average number of decision rules
colors show each node’s majority class (box, with rusty used by the decision trees is 231, which aids in the Random
representing the majority benign and blue representing the Forest classifier’s predictions.
majority malicious). The colors become darker as the node Table 11 and Table 12 present a few important decision
gets closer to becoming completely benign or malicious. rules for detecting clean and malicious PDF files respectively,
Similarly, the colors become lighter if the node contains including the conditions of the rule, the total number of
closer distribution of the samples among themselves. samples that come within the rules, as well as the right
As we can identify one of the decision rules in Fig. 22 predictions and the rule’s confidence, which indicates how
that indicates If (Malicecontent <= 0.5 and /ID > frequently the rules are found to be true. Both of these
0.5 and Headerlength <= 39.5) then Class : Benign. This tables provide a comprehensive explanation that can easily
reveals that if any PDF sample from our operational dataset be interpreted by humans and aid them in identifying clean

VOLUME 12, 2024 13853


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

FIGURE 22. A decision tree from one of the estimators of the Random Forest classifier used to detect PDF malware utilizing the final feature set.

FIGURE 23. A sample decision plot of an instance of Benign Class using


SHAP values. FIGURE 24. A sample decision plot of an instance of Malicious Class
using SHAP values.

and harmful PDFs. As we look into Table 11, we find that the
first decision rule signifies that if the PDF sample does not On the contrary, we find comparatively less effective rules at
contain any /Javascript and its title length is less or equal to the bottom of Table 11 (such as Rule ID 9 to 15) where the
39.5 then the PDF sample falls into the benign category. This rules yield a confidence level of around 90% to 98% covering
rule accurately identifies 7154 PDFs from our operational a small number of samples to clearly identify them as clean
dataset, suggesting that the rule has 100% confidence. Similar files.
to the first decision rule, all the other rules of the table can be Looking at Table 12, we see that in the case of harmful
explained and interpreted by humans to clearly detect benign PDF detection, a far larger number of requirements must
PDFs. Moreover, we find several strong rules (such as Rule be verified than in the case of benign PDF detection.
ID 2 to 8) that yield 99% to 100% confidence as well as Despite the large constraints, we find a few strong rules
cover a wide range of samples for identifying benign PDFs. (such as Rule ID 1 to 5) that offer nearly 100% confidence

13854 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

in recognizing thousands of samples from our operational PeePDF PDF analysis tools to extract the potential features
dataset as malicious. On the contrary, we identify somewhat that were critical for the classification task. However, their
less effective decision rules that only apply to a very tiny suggested strategy provided a maximum accuracy of 97.4%
number of samples from our dataset as we carefully examine in completing the task. Besides, the authors in [4] proposed
the rules from 6 to 10 in the table. Similar to the explanation an approach called O-DT (Optimizable Decision Tree) for
indicated in Table 11, the decision rules of Table 12 can PDF malware detection. The authors utilized a benchmark
be extensively described and interpreted. For instance, the dataset to perform their intended experiments and got an
first rule of Table12 states that a PDF sample is considered accuracy of 98.84% for the suggested approach. The study
malicious if it lacks the /XML feature, is not optimized, has a in [46] presented a dataset consisting of 10,025 PDF samples
file size greater than 1.41 kilobytes and less than or equal to based on evasive characteristics of PDF files. Moreover, the
46.52 kilobytes, a title length greater than 39 and less than or authors suggested an ensemble classifier based on stacking
equal to 63, contains a cross-reference table less than or equal learning which provided 98.69% accuracy to detect PDF
to 1.5 times, and does not have the /CreationDate feature. The malware. However, we notice that our work outperforms the
rule correctly identifies 5079 malicious PDF samples from existing works presented in Table 13, covering a large number
the dataset. of PDF samples with the use of three advanced tools for
feature extraction, deriving important features, and utilizing
the power of the Random Forest classifier to achieve a
much better accuracy of 99.24% for detecting PDF malware.
Furthermore, to the best of our knowledge, none of the
research presented in Table 13 gives a thorough human
interpretation of the classifier’s performance by illustrating a
decision tree and identifying decision rules for PDF malware
detection.

V. DISCUSSION
We created a dataset considering the malicious, clean, and
evasive PDFs to detect PDF malware. However, we take
FIGURE 25. Number of decision rules derived from decision trees of the only 792 evasive PDFs which is approximately 5% of
random forest classifier.
the entire dataset. The intuition of introducing the evasive
Similarly, the rest of the decision rules of Table 12 can PDFs is to reduce the bias of the classifier. Moreover, the
be explained and explained and interpreted for recognizing evasive characteristics of the PDFs make the classifier more
malicious PDFs. Nevertheless, these crucial decision rules robust in the detection of PDF malware. We maintained an
mentioned in Table 11 and Table 12 can significantly con- approximately balanced distribution between the benign and
tribute to a clear understanding of humans for categorizing malicious PDFs to overcome the problem of skewness of
benign and malicious PDFs. the classifier to a particular class. To analyze the PDFs and
Finally, to assess this study of PDF malware detection, extract the useful features, we used three well-known and
we perform a comparison with various existing works of the highly accurate tools PDFiD, PDFINFO, and PDF-PARSER.
same study discipline. Table 13 summarizes this comparative The idea of using these tools is to develop an effective
study, in which our work is evaluated from a variety of feature set for PDF malware detection by exploring multiple
perspectives, including the PDF sample source, the number efficient tools that ensure the acceptability of the extracted
of samples considered for the study, PDF labels, PDF analysis standard features of the PDFs. Additionally, we derived a
tools, the total number of PDF features considered for the few important features and merged them with the standard
study, the number of derived features developed for the features to generate a merged feature set. We built the final
analysis, the machine learning model used in the study, feature set by generating subsets from the merged feature set
the accuracy observed during the study, and whether the and assessed them utilizing the strength of the classifier.
study provides decision rules and human interpretation. We performed an in-depth experimental analysis of the
The authors of [1] described a method that used machine final feature set to explain the traits of the feature set.
learning classifiers to evaluate a given PDF both statistically We found the title length of the PDF files is crucial,
and interactively to identify the hazardous nature of the as malicious files tend to have an unusual length of the title
document. They ran their trials on 1200 PDF samples compared to clean files. Furthermore, metadata and structural
with PhoneyPDF (a PDF analysis tool) and discovered that features such as /Producer, /ProcSet, /ID, /CreationDate
the Random Forest classifier was the best fit to detect etc. are frequently observed inside the clean PDFs and
malicious PDFs, with an accuracy of 98.6%. Similarly, the seldomly found within the harmful ones. Attackers usually
authors in [33] implemented the Random Forest classifier to keep fraudulent files as small as possible by restricting the
identify malicious PDFs. They only used 1000 PDF samples contents, pages, fonts, and size with the intuition of carrying
in their pilot investigation and employed the PDFiD and out their attacks as swiftly as possible. We discovered that

VOLUME 12, 2024 13855


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

TABLE 11. Top decision rules extracted from random forest classifier to detect clean PDF.

cyber criminal’s primary intention is to insert malice-related the strength of the classifier can be enhanced to combat
contents such as inserting JavaScript code, OpenAction files, modern advanced attacks more precisely if we can include
etc. within the structure of the PDFs to harm the victim’s more evasive PDF properties through careful inspection.
systems. We extracted characteristics in this experiment using three
We provided an explicit interpretation and explanation of tools: PDFiD, PDFINFO, and PDF-PARSER. These tools
the classifier’s performance by generating a decision tree while very popular, are known to have vulnerabilities (for
from one of the classifier’s estimators, as well as highlighting instance, PDFiD tool) to some attacks. One such attack is
a few critical decision rules for recognizing malicious and known as the parser confusion attack where the fraudulent
clean PDFs. We discovered some strong decision rules for material is disguised and concealed using a variety of
recognizing both types of PDFs that provide up to 100% approaches to avoid detection while retaining the ability
confidence and can identify a large number of samples; to execute and exploit. Also, run-time and other dynamic
nevertheless, we noticed a number of rules that require a characteristics may be leveraged to further investigate
significant number of constraints to be verified, making them questionable documents. We intend to address each of these
somewhat less effective. In addition to that, these weak rules constraints in our future work. Additional analysis can be
can accurately identify a small number of samples yet yield performed by combining aspects from various parsers and
high confidence. However, the decision rules offer a clear analysis techniques to investigate complicated content such
understanding and interpretation of how the features can be as JavaScript code.
utilized to detect PDF malware. Furthermore, the present feature set is derived from three
In this study, we added evasive behaviors to our experimen- extraction methods, and the features employed by the three
tal dataset to make our classifier more resilient. Nevertheless, programs depend on heuristics and insights made by their

13856 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

TABLE 12. Top decision rules extracted from random forest classifier to detect malicious PDF.

TABLE 13. Comparision of our work with various existing studies for PDF malware detection.

developers. A greater comprehensive feature set can be scenarios or simulations to justify its practical effectiveness in
added by incorporating new sources, such as malicious identifying various types of PDF malware. Besides, we intend
document generation tools or in-depth study of malicious to investigate certificateless signcryption [48] and proxy
PDF documents. One such analysis can be to consider the signcryption [49] as advanced strategies for safeguarding
internal text of malicious PDFs where the attackers can PDFs which can add additional layers of security for
hide their harmful code segment behind the text content. PDFs that could potentially mitigate the risks posed by
Also, we want to assess the generalizability of our suggested PDF malware and leading the way for future research that
method against multiple types of PDF malware by investigat- integrates cryptographic techniques with malware detection.
ing how the model performs against different types of PDF In addition to that, adversarial PDF malware still poses a great
malware, including newer or more advanced variations. Plus, threat to a secure cyberspace. In the future, to combat such
we want to implement the proposed method in real-world threats we want to develop a data-driven intelligent approach

VOLUME 12, 2024 13857


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

that can tackle adversarial PDF malware effectively. Besides, [13] S. Atkinson, G. Carr, C. Shaw, and S. Zargari, ‘‘Drone forensics: The
we want to publish an additional dataset comprised solely of impact and challenges,’’ in Digital Forensic Investigation of Internet of
Things (IoT) Devices. Cham, Switzerland: Springer, 2021, pp. 65–124.
evasive PDF samples covering a wide range of approaches to [14] C. Liu, C. Lou, M. Yu, S. M. Yiu, K. P. Chow, G. Li, J. Jiang, and W. Huang,
cyber attacks. ‘‘A novel adversarial example detection method for malicious PDFs using
multiple mutated classifiers,’’ Forensic Sci. Int., Digit. Invest., vol. 38,
Oct. 2021, Art. no. 301124.
VI. CONCLUSION [15] Q. A. Al-Haija and A. Ishtaiwi, ‘‘Machine learning based model to identify
In this study, we performed an extensive analysis for firewall decisions to improve cyber-defense,’’ Int. J. Adv. Sci., Eng. Inf.
PDF malware detection. For this, we first developed a Technol., vol. 11, no. 4, p. 1688, Aug. 2021.
[16] D. Stevens. (2023). PDFid (Version 0.2.8). [Online]. Available:
comprehensive dataset of 15958 PDF samples by taking into https://blog.didierstevens.com/programs/pdf-tools
account the non-malicious, malicious, and evasive natures of [17] PDF-Info. (2021). PDF-Info (Version 2.1.0). [Online]. Available:
the PDF samples. We also developed a method to generate an https://pypi.org/project/pdf-info/
effective and explainable feature set by extracting important [18] D. Stevens. (2023). PDF-Parser (Version 0.7.8). [Online]. Available:
https://blog.didierstevens.com/programs/pdf-tools
traits from our freshly constructed dataset’s PDF samples [19] M. Yu, J. Jiang, G. Li, C. Lou, Y. Liu, C. Liu, and W. Huang, ‘‘Malicious
using multiple PDF analysis tools. Further, we also derived documents detection for business process management based on multi-
features that are empirically demonstrated to be useful for layer abstract model,’’ Future Gener. Comput. Syst., vol. 99, pp. 517–526,
Oct. 2019.
classifying PDF malware. We investigated different machine [20] H. Pareek, P. Eswari, N. S. C. Babu, and C. Bangalore, ‘‘Entropy and n-
learning classifiers and highlighted the effectiveness of the gram analysis of malicious pdf documents,’’ Int. J. Eng., vol. 2, no. 2,
Random Forest model not only for performance comparison pp. 1–3, 2013.
but also for the explainability analysis with generating [21] C. Smutz and A. Stavrou, ‘‘Malicious PDF detection using metadata and
structural features,’’ in Proc. 28th Annu. Comput. Secur. Appl. Conf.,
decision rules. Moreover, we clarified the behaviors of the Dec. 2012, pp. 239–248.
characteristics in charge of detecting PDF malware and [22] D. Maiorca, G. Giacinto, and I. Corona, ‘‘A pattern recognition system
pointed out a few relevant observations that may aid in the for malicious pdf files detection,’’ in Proc. Int. Workshop Mach. Learn.
Data Mining Pattern Recognit. Cham, Switzerland: Springer, 2012,
detection of hazardous PDF files. Finally, we compared our pp. 510–524.
findings to several state-of-the-art research and highlighted [23] H. Pareek, ‘‘Malicious pdf document detection based on feature extraction
some key observations of our study. and entropy,’’ Int. J. Secur., Privacy Trust Manage., vol. 2, no. 5, pp. 31–35,
Oct. 2013.
[24] D. Maiorca, D. Ariu, I. Corona, and G. Giacinto, ‘‘A structural and content-
REFERENCES based approach for a precise and robust detection of malicious PDF files,’’
[1] S. S. Alshamrani, ‘‘Design and analysis of machine learning based in Proc. Int. Conf. Inf. Syst. Secur. Privacy (ICISSP), Feb. 2015, pp. 27–36.
technique for malware identification and classification of portable [25] N. Šrndić and P. Laskov, ‘‘Hidost: A static machine-learning-based
document format files,’’ Secur. Commun. Netw., vol. 2022, pp. 1–10, detector of malicious files,’’ EURASIP J. Inf. Secur., vol. 2016, no. 1,
Sep. 2022. pp. 1–20, Dec. 2016.
[2] P. Singh, S. Tapaswi, and S. Gupta, ‘‘Malware detection in PDF and office [26] P. Laskov and N. Šrndić, ‘‘Static detection of malicious JavaScript-
documents: A survey,’’ Inf. Secur. J., Global Perspective, vol. 29, no. 3, bearing PDF documents,’’ in Proc. 27th Annu. Comput. Secur. Appl. Conf.,
pp. 134–153, May 2020. Dec. 2011, pp. 373–382.
[3] N. Livathinos, C. Berrospi, M. Lysak, V. Kuropiatnyk, A. Nassar, [27] Z. Tzermias, G. Sykiotakis, M. Polychronakis, and E. P. Markatos,
A. Carvalho, M. Dolfi, C. Auer, K. Dinkla, and P. Staar, ‘‘Robust PDF ‘‘Combining static and dynamic analysis for the detection of malicious
document conversion using recurrent neural networks,’’ in Proc. AAAI documents,’’ in Proc. 4th Eur. Workshop Syst. Secur., Apr. 2011, pp. 1–6.
Conf. Artif. Intell., vol. 35, no. 17, 2021, pp. 15137–15145. [28] C. Vatamanu, D. Gavrilut, and R. Benchea, ‘‘A practical approach on
[4] Q. A. Al-Haija, A. Odeh, and H. Qattous, ‘‘PDF malware detection based clustering malicious PDF documents,’’ J. Comput. Virol., vol. 8, no. 4,
on optimizable decision trees,’’ Electronics, vol. 11, no. 19, p. 3142, pp. 151–163, Nov. 2012.
Sep. 2022. [29] F. Schmitt, J. Gassen, and E. Gerhards-Padilla, ‘‘PDF scrutinizer:
[5] Y. Wiseman, ‘‘Efficient embedded images in portable document format,’’ Detecting JavaScript-based attacks in PDF documents,’’ in Proc. 10th
Int. J., vol. 124, pp. 38–129, Jan. 2019. Annu. Int. Conf. Privacy, Secur. Trust, Jul. 2012, pp. 104–111.
[6] M. Ijaz, M. H. Durad, and M. Ismail, ‘‘Static and dynamic malware analysis [30] S. Karademir, T. Dean, and S. Leblanc, ‘‘Using clone detection to find
using machine learning,’’ in Proc. 16th Int. Bhurban Conf. Appl. Sci. malware in acrobat files,’’ in Proc. Conf. Center Adv. Stud. Collaborative
Technol. (IBCAST), Jan. 2019, pp. 687–691. Res., 2013, pp. 70–80.
[7] Y. Alosefer, ‘‘Analysing web-based malware behaviour through client [31] X. Lu, J. Zhuge, R. Wang, Y. Cao, and Y. Chen, ‘‘De-obfuscation and
honeypots,’’ Ph.D. dissertation, School Comput. Sci. Inform., Cardiff detection of malicious PDF files with high accuracy,’’ in Proc. 46th Hawaii
Univ., Cardiff, Wales, U.K., 2012. Int. Conf. Syst. Sci., Jan. 2013, pp. 4890–4899.
[8] N. Idika and A. P. Mathur, ‘‘A survey of malware detection techniques,’’ [32] I. Corona, D. Maiorca, D. Ariu, and G. Giacinto, ‘‘Lux0R: Detection of
Purdue Univ., vol. 48, no. 2, pp. 32–46, 2007. malicious PDF-embedded Javascript code through discriminant analysis
[9] M. Abdelsalam, M. Gupta, and S. Mittal, ‘‘Artificial intelligence assisted of API references,’’ in Proc. Workshop Artif. Intell. Secur. Workshop,
malware analysis,’’ in Proc. ACM Workshop Secure Trustworthy Cyber- Nov. 2014, pp. 47–57.
Phys. Syst., Apr. 2021, pp. 75–77. [33] A. Falah, L. Pan, S. Huda, S. R. Pokhrel, and A. Anwar, ‘‘Improving mali-
[10] W. Wang, Y. Shang, Y. He, Y. Li, and J. Liu, ‘‘BotMark: Automated cious PDF classifier with feature engineering: A data-driven approach,’’
botnet detection with hybrid analysis of flow-based and graph-based traffic Future Gener. Comput. Syst., vol. 115, pp. 314–326, Feb. 2021.
behaviors,’’ Inf. Sci., vol. 511, pp. 284–296, Feb. 2020. [34] Virustotal. Accessed: Jun. 18, 2023. [Online]. Available: https://
[11] N. Srndic and P. Laskov, ‘‘Practical evasion of a learning-based classifier: www.virustotal.com/gui/home/upload
A case study,’’ in Proc. IEEE Symp. Secur. Privacy, May 2014, [35] A. Kang, Y.-S. Jeong, S. Kim, and J. Woo, ‘‘Malicious PDF detection
pp. 197–211. model against adversarial attack built from benign PDF containing
[12] D. Maiorca, I. Corona, and G. Giacinto, ‘‘Looking at the bag is not enough Javascript,’’ Appl. Sci., vol. 9, no. 22, p. 4764, Nov. 2019.
to find the bomb: An evasion of structural methods for malicious PDF [36] D. Maiorca and B. Biggio, ‘‘Digital investigation of PDF files: Unveiling
files detection,’’ in Proc. 8th ACM SIGSAC Symp. Inf., Comput. Commun. traces of embedded malware,’’ IEEE Secur. Privacy, vol. 17, no. 1,
Secur., May 2013, pp. 119–130. pp. 63–71, Jan. 2019.

13858 VOLUME 12, 2024


G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

[37] N. Nissim, A. Cohen, C. Glezer, and Y. Elovici, ‘‘Detection of malicious KAUSHIK DEB received the B.Tech. and M.Tech.
PDF files and directions for enhancements: A state-of–the art survey,’’ degrees from the Department of Computer Science
Comput. Secur., vol. 48, pp. 246–266, Feb. 2015. and Engineering, Tula State University, Tula,
[38] M. Xu and T. Kim, ‘‘$PlatPal$: Detecting malicious documents with Russia, in 1999 and 2000, respectively, and
platform diversity,’’ in Proc. 26th USENIX Secur. Symp. (USENIX Secur.), the Ph.D. degree in electrical engineering and
2017, pp. 271–287. information systems from the University of Ulsan,
[39] Y. Chen, S. Wang, D. She, and S. Jana, ‘‘On training robust $PDF$ Ulsan, South Korea, in 2011. Since 2001, he has
malware classifiers,’’ in Proc. 29th USENIX Secur. Symp. (USENIX Secur.), been a Faculty Member of the Department of
2020, pp. 2343–2360.
Computer Science and Engineering (CSE), Chit-
[40] C. Smutz and A. Stavrou, ‘‘When a tree falls: Using diversity in ensemble
tagong University of Engineering and Technology
classifiers to identify evasion in malware detectors,’’ in Proc. Netw. Distrib.
Syst. Secur. Symp., 2016, pp. 1–15. (CUET), Chattogram, Bangladesh, where he is currently a Professor with the
[41] M. Li, Y. Liu, M. Yu, G. Li, Y. Wang, and C. Liu, ‘‘FEPDF: A Department of CSE. Moreover, he was in various administrative positions
robust feature extractor for malicious PDF detection,’’ in Proc. IEEE with CUET, such as the Dean of the Faculty of Electrical and Computer
Trustcom/BigDataSE/ICESS, Aug. 2017, pp. 218–224. Engineering (ECE), from 2017 to 2019, the Director of the Institute of
[42] D. Liu, H. Wang, and A. Stavrou, ‘‘Detecting malicious Javascript in PDF Information and Communication Technology (IICT), from 2015 to 2017,
through document instrumentation,’’ in Proc. 44th Annu. IEEE/IFIP Int. and the Head of the CSE Department, from 2012 to 2015. He made a
Conf. Dependable Syst. Netw., Jun. 2014, pp. 100–111. variety of contributions to managing and organizing conferences, workshops,
[43] N. Šrndic and P. Laskov, ‘‘Detection of malicious pdf files based on and other academic gatherings. He has published more than 110 technical
hierarchical document structure,’’ in Proc. 20th Annu. Netw. & Distrib. articles with peer reviews. His research interests include computer vision,
Syst. Secur. Symp., 2013, pp. 1–16. deep learning, pattern recognition, intelligent transportation systems (ITSs),
[44] Canadian Institute for Cybersecurity (CIC). (2022). PDF dataset: and human–computer interaction. He was a Steering Member. He acted
CIC-Evasive-PDFMAL2022. [Online]. Available: https://www.unb. as the Chair or a Secretary in a variety of international and national
ca/cic/datasets/pdfmal-2022.html conferences, such as the International Conference on Electrical, Computer,
[45] (2013). Contaigo, 16,800 Clean and 11,960 Malicious Files for Sig- and Communication Engineering (ECCE), the International Forum on
nature Testing and Research. [Online]. Available: http://contagiodump. Strategic Technology (IFOST), the International Workshops on Human
blogspot.com/2013/03/16800-clean-and-11960-malicious-files.html
System Interactions (HSI), and the National Conference on Intelligent
[46] M. Issakhani, P. Victor, A. Tekeoglu, and A. Lashkari, ‘‘PDF malware
Computing and Information Technology (NCICIT).
detection based on stacking learning,’’ in Proc. 8th Int. Conf. Inf. Syst.
Secur. Privacy, 2022, pp. 562–570.
[47] E. Frank, M. A. Hall, and I. H. Witten, ‘‘Data mining: Practical
machine learning tools and techniques,’’ in The WEKA Workbench,
4th ed. San Mateo, CA, USA: Morgan Kaufmann, 2016. [Online].
HELGE JANICKE received the Ph.D. degree from
Available:http://www.cs.waikato.ac.nz/ml/weka/book.html De Montfort University (DMU), U.K., in 2007.
[48] I. Ullah, N. Ul Amin, M. Zareei, A. Zeb, H. Khattak, A. Khan, He is currently a Professor in cybersecurity with
and S. Goudarzi, ‘‘A lightweight and provable secured certificate- Edith Cowan University (ECU), Australia. He is
less signcryption approach for crowdsourced IIoT applications,’’ Sym- also the Director of the Security Research Institute,
metry, vol. 11, no. 11, p. 1386, Nov. 2019. [Online]. Available: ECU, and the Research Director for Australia’s
https://www.mdpi.com/2073-8994/11/11/1386 Cyber Security Cooperative Research Centre.
[49] A. Waheed, A. I. Umar, M. Zareei, N. Din, N. U. Amin, J. Iqbal, He established DMU’s Cyber Technology Insti-
Y. Saeed, and E. M. Mohamed, ‘‘Cryptanalysis and improvement of a tute, DMU, and its Airbus Centre of Excellence
proxy signcryption scheme in the standard computational model,’’ IEEE for SCADA cybersecurity and digital forensics
Access, vol. 8, pp. 131188–131201, 2020. research, and heading up DMU’s School of Computer Science. His research
interests include cybersecurity in critical infrastructure, human factors of
cybersecurity, the cybersecurity of emerging technologies, digital twins, and
the Industrial IoT.

IQBAL H. SARKER (Member, IEEE) received


the Ph.D. degree in computer science from the
Swinburne University of Technology, Melbourne,
Australia, in 2018. He is currently a Research Fel-
low of the Cyber Security Cooperative Research
Centre (CRC) in association with the ECU Secu-
rity Research Institute, Edith Cowan University
(ECU), Australia. His research interests include
cybersecurity, AI/XAI and machine learning, data
science and behavioral analytics, digital twin,
smart city applications, and critical infrastructure security. He has published
more than 100 journals and conference papers in various reputed venues
G. M. SAKHAWAT HOSSAIN received the published by Elsevier, Springer Nature, IEEE, ACM, and Oxford University
Bachelor of Science degree in computer science Press. Moreover, he is a lead author of a research monograph book titled
and engineering from the Rajshahi University Context-Aware Machine Learning and Mobile Data Analytics (Springer
of Engineering and Technology. He is currently Nature, Switzerland, 2021). He has also been listed in the world’s top 2%
pursuing the Master of Science degree in computer of most-cited scientists, published by Elsevier & Stanford University, USA.
science and engineering with the Chittagong In addition to research work and publications, he is also involved in a number
University of Engineering and Technology. He is of research engagement and leadership roles, such as journal editorial,
also a Lecturer with Rangamati Science and international conference program committee (PC), student supervision,
Technology University, Chattogram. His research visiting scholar, and national/international collaboration. He is a member of
interests include malware analysis, natural lan- ACM and Australian Information Security Association.
guage processing, computer vision, and machine learning.

VOLUME 12, 2024 13859

You might also like