PDF Malware Detection Toward Machine Learning Modeling With Explainability Analysis
PDF Malware Detection Toward Machine Learning Modeling With Explainability Analysis
ABSTRACT The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters
insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions
and identification techniques are often insufficient and may only partially prevent PDF malware because of
their versatile character and excessive dependence on a certain typical feature set. The primary goal of this
work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal,
we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent,
malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD,
PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly
created dataset. In addition, we generate a number of derivations of features that have been experimentally
proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable
feature set through the proper empirical analysis of the extracted and derived features. We explore different
baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the
Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model’s
explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make
a comparison with previous studies and point out some important findings.
INDEX TERMS Cybersecurity, PDF malware, data analytics, machine learning, decision rule, explainable
AI, human interpretation.
intelligence and other techniques, the behavior-based strategy consumes a longer period and necessitates an additional
can recognize unidentified and sophisticated malware to complex analysis method [10].
some extent, although it is a complex process. Current malware identification methods frequently choose
The fundamental structure of a PDF that contains header, feature sets according to findings from a manual inspection
body, xref (cross reference) table, and trailer is illustrated of harmful PDF files and are guided by the expertise
in Fig. 1 [3]. The PDF’s header indicates which version of of the specialist. The chosen features, nevertheless, are
the parser format will be used. Text blocks, typefaces, file- occasionally exclusive for fraudulent files, luring adversaries
specific metadata, and images are all included in the PDF’s to possibly gain authority over how and what a malicious
body, which also specifies its content [4]. There are four file looks like, and evading the current detectors (while
categories into which the contents of PDF can be placed: preserving their malicious properties). For instance, the
numbers, strings, streams, and booleans [5]. Each item in the Mimicry [11] and Reverse Mimicry [12] incidents have been
PDF file has an entry in the cross-reference table that details exacerbated by the observation that the program builders
its byte offset or placement in the file as well as enables infrequently disclose comprehensive information about the
speedy random access to particular objects, facilitating measures adopted to maintain their integrity and resistance
effective document exploration and content retrieval. A PDF to risks. In addition, the data that is accessible to developers,
reader or parser may traverse and access the different items such as clean and harmful samples, datasets, vulnerabilities,
within the file by using the trailer, which gives them all the payloads, and attack vectors employed within, also constrains
necessary information such as PDF size, root object, metadata their work. Such situations result in the produced solutions
info, encryption info, and unique identifier of the PDF file. becoming outdated considerably earlier than what the diligent
PDF malware can usually be created by injecting malicious developers had planned.
content or programs into the elements of the fundamental Machine learning applications have advanced to the point
structure of a PDF. where they can now protect systems from threats or aid
forensic professionals in their investigations by spotting
likely malicious PDF files [13]. However, adversarial tech-
niques have grown capable of compromising threat document
analyzers. Numerous machine-learning-based detection tools
are at risk because their identification of well-crafted evasive
scenarios may be erroneous [14], [15]. Various evaluations
or detection methods have been created to detect specific
incidents, but the immediate threat posed by evasive attacks
has not yet been mitigated.
Developing feature engineering improvements integrating
the adversarial behaviors of malicious PDFs for creating
harmful PDF classifiers is challenging, yet necessary, and
has an opportunity to have a significant impact in the field.
We look at ways to improve the identification approach for
PDF malware by 1) introducing an inclusive dataset that
FIGURE 1. A sample structure of PDF.
contains evasive characteristics of suspicious PDFs along
with clean and harmful PDF samples, 2) extracting the
PDF malware can be analyzed using a static, dynamic, features of the PDF samples and 3) merging the most
or hybrid approach [6]. The static technique inspects malware significant features to develop an effective feature set that can
refraining from executing the program that it embeds, but be fed into a classifier to produce a higher level of accuracy.
the dynamic method inspects malware by executing its We provide a comprehensive analysis of the most significant
code [7], [8]. Static analysis becomes susceptible when features identified for PDF malware detection and interpret
extensive evasion and fraudulent methods are used to disguise the classifier’s performance. In summary, our contributions
harmful execution behavior. In the present cybersecurity can be outlined as follows:
circumstances, depending solely on static inspection is often • We have developed a comprehensive dataset that
inadequate since a perpetrator who is dedicated to their attack consists of a total of 15958 PDF samples including
would disguise and encode their code, making it normally 7500 clean PDFs, 7666 malicious PDFs, and 792 evasive
invisible to static inspection [4]. Dynamic techniques, on the PDFs by considering the non-malicious, malicious, and
contrary, are more resilient to code deception, causing them evasive natures of the PDF samples. For this, we use
to be a better defense against advanced viruses [9]. Dynamic three popular PDF analysis tools viz. PDFiD [16],
analysis is often slow and challenging, but static analysis PDFINFO [17], and PDF-PARSER [18].
tends to be fast. Integrating the two approaches results • We develop a method to build an explicable feature set
in hybrid analysis, which is more effective in combating by taking into account the feature’s characteristics and
advanced malware than either method alone but additionally importance score.
• We have designed an architecture for malicious PDF method, catchphrases, object names, and intelligible strings
detection and explored different machine learning clas- in JavaScript. Additionally, since subtle changes have a
sifiers to analyze and compare their efficacy in different significant impact on AI calculations, it is challenging to
cases. develop hostile models when the attributes vary. To reduce
• We have demonstrated the model’s explainability by the risk of malicious attacks while maintaining structures and
creating a decision tree that generates rules for human data properties, they developed a classification model using
interpretation. discovery-type models. They created an adversarial attack
• We have conducted a wide range of experimental in order to accept the suggested paradigm. An outline of
analyses and compared our results to previous studies. the PDF was provided in [36], and contemporary attacks on
We also highlight some key observations of our study. PDF malware were carried out using reliable attack models
obtained from nature. They gave an example of how to use
The rest of the paper is organized as follows: Section II
programming skills to perform a quantitative analysis of a
provides an in-depth and organized overview of current
PDF file to look for signs of contained malware. They looked
research in the same field of study. Section III describes
at some of the emerging AI-powered tools for detecting
the recommended approach for PDF malware detection.
PDF malware which may assist computational scientific
Section IV presents and evaluates our findings, as well as
analyses and can flag questionable documents before a more
the experimental outcomes. Section V provides an in-depth
thorough, more conclusive statistical analysis is published.
discussion while pointing out a few observations. Finally,
They looked at the PDF restrictions alongside various
closing remarks are offered in Section VI.
unresolved problems, especially how their flaws might be
used to potentially misdirect measured investigations. Finally,
II. RELATED WORKS they offered advice on how to make those structures more
In recent years, PDFs have been widely used to disseminate effective in withstanding attacks and sketched a possible
malicious documents and malware. To mitigate the subse- assessment.
quent and crucial growth of malicious PDF developments, Obfuscation strategies used by PDF maldoc authors were
numerous effective studies on detecting and categorizing noted by the study in [37]; these techniques hinder automated
technologies for malware and other dangerous files were evaluation and identification methods and make manual
established [19]. The tools that have been designed through- analysis more difficult. This involves exploiting PDF filters,
out the past years range greatly from being general and comments, and white space to spread harmful code across
straightforward to specific and complex. Certain techniques numerous objects. Other strategies include gathering around
try to find differences by scanning the whole file [20]. strewn harmful code fragments throughout the page using
An additional kind of technique searches an intended file for a ‘‘Names’’ dictionary. Furthermore, hazardous substances
resemblance to typical trends found in harmful PDF files [21], can be concealed in odd places like document metadata or
[22], [23], [24], [25]. Another set of tools concentrated on the fields (comments) of annotations. Moreover, memory
extracting, analyzing, and identifying attack methods, for spraying and the use of shellcodes to download malicious
instance, detecting JavaScript-based attacks [26], [27], [28], files or documents were included in the study of [37] for
[29], [30], [31], [32]. Most of these approaches are heavily the classification of PDF-based attacks as JavaScript code
reliant on machine learning methods, including one- and exploits.
two-class Support Vector Machines, Random Forests, and Because a PDF document acts identically on several
decision trees. devices, the authors in [38] developed a detection method
The study in [33] focused on developing an approach based on behavioral inconsistencies on those platforms
to recognize a group of features derived through currently using a software engineering idea. On the other hand,
available tools as well as generated a new group of a malicious document will behave differently depending
features aimed at improving PDF maldoc identification and on the platform. The study in [39] emphasized malware
prolonging the useful life of current analysis and detection inserted into PDF files as a representative example of
techniques. The importance of the produced features was contemporary cyberattacks. They began by classifying the
assessed using a wrapper function that leveraged three key various production processes for PDF malware scientifically.
supervised learning methods as well as a feed-forward They used a proven adversarial AI framework to counter
deep neural network. Subsequently, a novel classifier that PDF malware detectors that rely on learning. This strategy,
significantly improved classification efficacy with shorter for instance, made it possible to discover existing faults in
training times was constructed deploying features of the learning-oriented PDF malware trackers as well as novel
highest significance. With the use of huge datasets from threats that may threaten such architectures, as well as the
VirusTotal [34] the findings were verified. likelihood of protective actions.
From top to bottom, authors in [35] looked into PDF In [40], the authors outlined an innovative approach
design and JavaScript content contained in PDFs. They to detect data problems of an ensemble classifier. The
developed a wide range of features for design and metadata, ensemble classifier’s prediction was shown to be false when
including the number of bytes per second, the encoding enough individual classifier votes clashed during detection.
The recommended method, ensemble classifier consensus path. A tree was constructed from the document hierarchy
evaluation, facilitated the findings of various sorts of system and on the basis of the presence of specific paths the harmful
evasions without the necessity for additional external ground and clean files were identified, according to the authors.
truth. The authors tested the suggested approach using However, to combat the existing threats caused by PDF
PDFrate, a PDF malware detector, and revealed that a malware as well as to mitigate the challenges posed by the
significant number of assumptions could be derived utilizing evasive behavior of PDFs, we certainly require an effective
improved ensemble classifier concordance using the entire classifier that works with an explainable feature set covering
network’s data. the wide range of behaviors of PDFs. In this research,
The authors of [25] demonstrated how the least optimistic we developed a dataset that covers the characteristics of
case behavior of a malware detector in terms of specified clean and harmful PDFs along with a limited introduction
intensity features could be examined. Additionally, they dis- to the evasive behaviors of PDFs. Moreover, we identified
covered that creating classifiers with legally verified efficient an explainable feature set by extracting useful features from
features may raise the expense of avoiding unrestrained the PDFs by adopting three well-known PDF analysis tools.
attackers by simply skipping over simple assault avoidance Furthermore, we identified an effective machine learning
techniques. They put forth an alternative distance measure classifier that leverages the newly developed feature set
that relies on the tree structure of PDF and identified two to detect PDF malware with improved accuracy. Finally,
groups of strong features, such as erasures and subtree we provided a thorough explanation of the performance of
inclusions. the classifier by describing a decision tree built from one of
In [32], the researchers presented Lux0R, further referred the estimators of the classifier and extracting a few crucial
to as ‘‘Lux 0n discriminant References,’’ a novel and adapt- decision rules for detecting PDF malware effectively. In the
able approach for detecting malicious code in JavaScript. subsequent section, the details of the methodology used in
The recommended strategy hinged on describing code in this research will be discussed thoroughly.
JavaScript using API references, which contained elements
that a JavaScript Application Programming Interface (API) III. METHODOLOGY
can intuitively comprehend such as objects, constants, PDF files are among the most extensively used file types
functions, attributes, methods, and keywords. To isolate in the world. However, hackers can utilize PDF files, which
suggestive risky code of a certain subgroup from API are usually non-threatening, to introduce security dangers via
references, the proposed approach made use of machine malicious code, just as they can with PNG files, dot-com
learning which was subsequently used to spot JavaScript files, and Bitcoin [4]. As a result, PDF malware appears,
malware. The important application domain that the author demanding techniques for recognizing malicious from benign
focused on in this work was the detection of potentially files. This section discusses the proposed detection system for
harmful JavaScript code in PDF files. The weaknesses within analyzing and categorizing PDF files as benign or malicious.
existent extractors of features for PDFs were uncovered Fig.2 represents the inclusive graphical architecture of the
by the authors of [41] by evaluating them alongside proposed approach utilized to conduct this research. In
analyzing how the framework of the fraudulent documents our proposed approach, initially, we accumulate 29901 raw
was set up. The researchers subsequently developed FEPDF PDF samples from [44] which are originally picked from
(feature extractor-PDF), a sophisticated feature extractor, that Contagio Data Dump [45] and VirusTotal [34]. Then, PDF
was capable of discovering characteristics that traditional samples are divided into Benign, Malicious, and Evasive
extraction methods could lose and recorded accurate data categories according to their preassigned label as mentioned
concerning the PDF components. To investigate the most in [44] and [46]. Then, we choose 15958 PDF samples
recent antivirus frameworks along with pattern extractors, the of Benign, Malicious, and Evasive categories from the
authors created numerous fresh harmful PDFs as samples. 29901 raw samples and develop a comprehensive dataset
The results indicate that a number of existing antivirus for our experimental study. We utilize three up-to-date PDF
applications were unable to identify the fresh dangerous analysis tools viz. PDFiD [16], PDFINFO [17], and PDF-
PDFs, however, FEPDF was able to retrieve the essential PARSER [18] to extract effective standard feature set F1 ,
components for improved dangerous PDF classification. F2 , and F3 respectively from the raw PDF samples of our
In [42], an integrated detection technique was suggested to experimental dataset. In addition to the standard feature set,
track the JavaScript code’s runtime behavior along with the seven more features are derived by carefully observing the
recognition of features related to obfuscation, which included characteristics of PDF samples from the feature set F1 . The
concealing certain keywords’ presence with ASCII hexadec- standard feature sets F1 , F2 , and F3 are then implemented in
imal when several compression filters were used and the the model selection phase, which includes a set of baseline
existence of any void objects. The research in [43] was based machine learning classifiers. The model selection phase
on the odd disparities between harmful and clean document determines the best model amongst the baseline classifiers
construction. The authors adopted the tools that extracted employed in this study based on their effectiveness for each
the feature set utilizing the document hierarchy or structure feature set.
The extracted derived features are then merged with the 1) PDF SAMPLE COLLECTION
standard feature set F1 , F2 , and F3 respectively, to develop To carry out our experiments, we gather a large corpus of
the derived feature set F1′ , F2′ , and F3′ correspondingly. Later, raw PDF files from [44] which consists of 29901 PDFs.
we aim to generate feature subsets from F1′ , F2′ , and F3′ based The original sources of the PDF files are from two well-
on their importance and employ them in the best-performing known sites i.e. Contagio Data Dump [45] and VirusTotal
model to determine the most effective feature subsets. Finally, [34]. Among the collected PDF files, we get 9109 Benign
we execute a union operation on the three best subsets PDFs from Contagio Data Dump and 20000 malicious PDFs
acquired from F1′ , F2′ , and F3′ to construct the final feature set from VirusTotal. From [44], we also gather 792 evasive PDF
used for malicious PDF detection. The performance of our files among which 400 are labeled as benign evasive and
proposed approach is measured using various performance 392 are labeled as malicious evasive. Table 1 shows the
metrics such as precision, recall, f1-measure, and accuracy. distribution of the collected PDF files with their sources and
Furthermore, we highlight the impact of the final feature set their preassigned label.
and how much it contributes to the classification activities.
In addition, leveraging the strength of the best-performing TABLE 1. Collected PDF files for the experimental study.
model, we offer an explanation to make it more humanly
understandable by extracting some important decision rules
responsible for the classification activities. The details of
our proposed methodology for malicious PDF detection are
described below in the following subsections.
much possible that a clean PDF file can be generated using including its title, author, creation date, and more. In our
the nonmalicious JavaScript feature. However, developing operational dataset, we have examined metadata features
a machine learning classifier solely based on both of these such as Filesize_kb, /ID, /CreationDate, /ModDate,
categories can lead to an overfitted model for malicious pages, etc.
PDF identification. Moreover, JavaScript is not only the • Triggering Features: Triggering features refer to partic-
key characteristic that malicious PDF exhibits rather there ular traits or components in a PDF file that can possibly
are other triggering features. For instance, OpenAction, and cause various behaviors or actions, including harmful
Aditional Action (AA) are some of the few features that may ones. Attackers might deploy these features to distribute
indicate potential malicious activity of a PDF [33]. Besides, malware, execute scripts, or perform other malicious
adding too many diversities for both of these categories acts. In our dataset, we carefully analyze triggering
can potentially skew the feature weights and importance features such as /JavaScript, /JS, /OpenAction, /AA
which may cause an ineffective classification accuracy (Aditional Action, /Launch, and so on.
for our proposed model. Considering the aforementioned
reasoning, thirdly, we thoughtfully select a few PDF files
that exhibit malicious behaviors but are labeled as benign 1) PDFiD FEATURES
(benign evasive) and the PDF files that pose the benign PDFiD is a Python-based tool [16] for scanning PDF
behavior but are labeled as malicious (malicious evasive). documents in order to discover specific features and traits that
In this pilot experiment, we concentrate on developing an may signal possible maliciousness. PDFiD does not run any
operational dataset that has a total of 15958 PDF files code inside the PDF; instead, it concentrates on examining
including 7500 benign (clean), 7666 malicious, 400 benign the parts and arrangement of the PDF to shed light on its
evasive, and 392 malicious evasive PDF files, by setting a characteristics. We have gone through all of our dataset’s
limited scope to ensure that we get reliable findings as quickly PDF files and used the PDFiD tool to extract 22 features,
as possible. Table 2 represents the dispersion of the selected as shown in Fig. 3. These extracted features are considered
PDF files for our experimental study. for the standard feature set, F1 . In the following, we describe
the features in brief:
TABLE 2. Selected PDF files for our operational dataset.
• PDF Header: The PDF header is required for appli-
cations and software to appropriately identify and
comprehend PDF documents. The ‘‘%PDF’’ identifi-
cation is followed by a version number in the PDF
header. For instance, ‘‘%PDF-1.3’’ denotes that the PDF
file complies with PDF standard version 1.3. This code
notifies applications and PDF viewers that the document
is in PDF format.
• obj: PDF documents are made up of objects such as
B. STANDARD FEATURE SET EXTRACTION fonts, text, images, forms, etc. The term obj refers to
We adopt three tools viz. PDFiD, PDFINFO, and PDF- the opening of an object definition. This feature provides
PARSER, which extract features from PDF files. Although the total number of obj keywords that can be identified
the three tools serve the same objective, the results they within the PDF structure.
produce are different and can be leveraged to create three • endobj: The term endobj specifies the closing of the
different feature sets. The feature set extracted by the tools object definition. In the case of PDFiD, this feature
can be categorized into the following groups: points out how many times the endobj keyword appears
• Content-Related Features: Content-related features inside the PDF structure.
obtained from the PDF file yield clues regarding the • stream: A stream object is employed in PDF documents
file’s textual and visual content. For instance, features to hold binary data, such as fonts, images, or other binary
like /Image, /Font, /ProcSet etc. are a few examples of material, within the document. This feature provides the
the content-related features, we observe in our dataset. number of stream keywords that exist within the PDF
• Structure Related Features: The structural feature refers file.
to the construction elements utilized to create a PDF • endstream: This keyword denotes the completion of the
document. This type of feature provides an internal stream’s binary data portion. In the context of PDFiD,
relationship and exhibits the hierarchy among vari- this feature indicates how many endstream keywords can
ous elements of PDFs. We have considered various be found within a PDF document.
structural features, for instance, obj, endobj, %EOF, • xref: The xref (cross reference) table assists in maintain-
startxref, trailer etc. in our operational dataset. ing links between the structured objects that are stored
• Metadata Features: Metadata features of a PDF file in PDF files. PDFiD provides the number of xref tables
provide valuable information about the file itself, that exist inside a PDF document.
• trailer: The trailer is the final component of the PDF • /JBIG2Decode: This feature reveals the number of
file and contains crucial details about the byte offset /JBIG2Decode keywords that exist within the structure
to the beginning of the cross-reference (XRef) table. of a PDF file. The feature explains whether the PDF
In the case of PDFiD, this feature indicates how many uses the JBIG2 compression or not, although it does
trailers can be found within the structure of a PDF not provide any direct indication of maliciousness but
file. requires further analysis.
• startxref: The startxref keyword designates the location • /RichMedia: The feature demonstrates the number of
where the Xref table of the PDF is started. This feature /RichMedia keywords that can be found within the PDF
yields how many times we can find startxref keyword structure that provides an indication of flash files.
inside a PDF. • /Launch: This outputs the number of /Launch keywords
• /Page: This feature indicates the total number of pages that exist within the PDF.
of a PDF. • /EmbeddedFIle: This indicates the number of /Embed-
• /Encrypt: The feature outputs the number of /Encrypt dedFIle keywords that can be found inside the structure
keywords present within the PDF structure. of a PDF.
• ObjStm: The total number of object streams is counted • /XFA: Certain PDF files contain XFAs, which are XML
with /ObjStm. The ObjStm possesses the ability to hold Form architectures that offer scripting capabilities that
other objects, making it useful for hiding things. can be abused by attackers. This feature outputs the
• /JS: The number of objects that contain the /JS keyword number of /XFA keywords that can be observed inside a
which reveals the objects having JavaScript code. PDF file.
• /JavaScript: This feature demonstrates the number of • /Colors: This feature indicates the number of different
objects containing JavaScript code, a common and colors utilized in the PDF structure.
prevalent obfuscation technique.
• /AA: This feature denotes the number of /AA 2) PDFINFO FEATURES
(Additional Action) keywords observed inside a PDF PDFINFO is a command-line program that is a part of the
document. Poppler utility suite commonly used for extracting metadata
• /OpenAction: When a page or document is viewed, from PDF files. We extract 14 features from our operational
an automated action is indicated by the /OpenAc- dataset utilizing the PDFINFO tool as depicted in Fig. 4.
tion command. This feature demonstrates how many These 14 features are considered as the standard feature set
/OpenAction keywords a PDF document has inside its F2 . In the following, we describe the features of the feature
structure. set F2 :
• /AcroForm: The feature denotes the number of /Acro- • Custom Metadata: This feature indicates the presence of
Form keywords that exist within a PDF file. The Acrobat user-defined custom metadata inside a PDF document.
forms used in PDF files can be exploited by the attackers. The feature provides the value as ‘yes’ or ‘no’.
the keyword has been frequently encountered inside the • /ModDate: Number of /ModDate entries that can be
parsed structures of clean PDF files. discovered inside the parsed structure of a PDF. The
• /ID: Number of /ID keywords that can be discovered modification date and time of the PDF file are specified
in the parsed structure of a PDF. This keyword reveals using the /ModDate keyword.
the document ID that is crucial for the integrity and • /Info: Number of /Info keywords that can be spotted
security of the document which can indicate whether the inside the parsed structure of a PDF. The term /Info
document was tampered with malicious activity or not. describes the document’s information dictionary.
• /S: Number of /S keywords that can be spotted in the • /XML: Number of /XML entries that can be discovered
parsed structure of a PDF. The keyword indicates the inside the parsed structure of a PDF.
subtype of various objects or tasks, such as text or link • Comment: Number of comments that are noticed inside
annotations. the parsed structure of a PDF.
• /CreationDate: Number of /CreationDate keywords that • /Widget: Number of /Widget keywords that are found
can be discovered in the parsed structure of a PDF. within the parsed structure of a PDF. The /Widget anno-
• obj: Number of objects that can be spotted inside the tations are interactive components that are employed in
parsed structure of a PDF. PDF files, particularly PDF forms, which enable users
• xref: Number of xref that can be observed within the to interact with input data.
parsed structure of a PDF. • Referencing: Number of Referencing keywords that are
• ≪: Number of ’≪’ keywords that can be noticed in the noticed inside the parsed structure of a PDF.
parsed structure of a PDF. In a PDF file, the ’≪’ signifies • /FontDescriptor: Number of /FontDescriptor keywords
the start of a dictionary object. that are discovered within the parsed structure of a PDF.
• ≫: Number of ’≫’ keywords that can be noticed in the • /Image: Number of /Image keywords that can be found
parsed structure of a PDF. In a PDF file, the ’≫’ signifies within the parsed structure of a PDF.
the closing of a dictionary object. • /Rect: Number of /Rect keywords that are observed
• /Font: Number of /Font entries that can be discovered within the parsed structure of a PDF.
inside the parsed structure of a PDF. • /Length: Number of /Length keywords noticed within a
• /XObject: Number of /XObject keywords that can be PDF’s parsed structure. The /Length keyword specifies
observed within the parsed structure of a PDF. The the length or size of the content stream related to a PDF
/XObject keyword is utilized to indicate and encapsulate object in bytes.
external graphical material such as images, forms, and • /Action: Number of /Action keywords noticed within a
other sophisticated objects. PDF’s parsed structure.
Algorithm 1 Steps For Final Feature Set Generation In addition, we utilize the strength of the classifier to
Input: Derived Feature Sets, F ′ = {F1′ , F2′ , F3′ } discover a few key decision rules that humans can understand
Output: Final Subset containing the union of best subsets easily and apply to identify potentially dangerous PDF
1: for f in F ′ do content.
2: Find the importance of features in f
3: Sort the features in f based on importance A. EVALUATION METRICS
4: Generate subset S = {S1, S2, S3, . . .} by taking top We employ the abbreviations for the evaluation metrics listed
features from f below to analyze the classification report:
5: Initialize best_subset as an empty set • Acc: The term accuracy is abbreviated as Acc, and it can
6: for s in S do be assessed using the following formula:
7: Apply s on the Best ML Model Selected for PDF
TP + TN
Malware Detection Accuracy =
8: Generate Classification Report TP + FN + TN + FP
9: if subset performance is better than best_subset where TP means True Positive, FP means False Positive,
performance then TN means True Negative, and FN stands for False
10: Set best_subset to s Negative.
11: end if • Pr: The term precision is abbreviated as Pr can be
12: end for measured by
13: Perform a union operation to append best_subset to TP
Final_Subset Precision =
TP + FP
14: end for
• Rec: The abbreviation Rec is used in place of the term
Recall, which can be quantified by
TP
identifying PDF malware? and 2) What is the suitable Recall =
TP + FN
machine learning model for PDF malware detection?
• Case II: In this scenario, our key focus is to uncover the • F1: The term F1-Score is denoted as F1, which can be
findings of the following queries: 1) What is the impact calculated by using the
of the derived feature sets in identifying PDF malware? Pr ∗ Rec
and 2) How do the derived feature sets F1′ , F2′ , and F3′ F1 = 2 ∗
Pr + Rec
contribute to the detection of PDF malware?
• Case III: In this case, we investigate the answers to the B. CASE I
following questions: 1) What features are selected for In this case, we look at the impact of standard feature sets F1 ,
the final feature set? and 2) Does the final feature set F2 , and F3 , as well as baseline machine learning classifiers,
boost the performance of the classifier? to determine the top-performing model for detecting PDF
• CASE IV: In this particular circumstance, we look into malware. Table 3 demonstrates the performance of the
finding the answers to the following queries: 1) How various baseline machine learning classifiers along with a
does the combined feature set (i.e. F1 + F2 + F3 + derived deep neural network (DNN) for PDF malware detection
features which also can be represented by F1′ UF2′ UF3′ utilizing the standard feature set F1 , F2 , and F3 based on
) help in the detection of malicious PDF? 2) Will the 10-fold cross-validation. Conspicuously, we can observe that
best feature subset generated from the combined feature the Random Forest classifier yields the best accuracy for all
set by taking into account the feature importance and the standard feature sets compared to the baseline classifiers
approach described in CASE III differ from the final while identifying PDF malware. We utilized Scikit-Learn,
feature set acquired in CASE III? 3) Will the best a well-known open-source machine-learning library for
feature subset produced in this scenario have a positive Python to implement the baseline classifiers. We constructed
or negative impact on classification performance? and the Random Forest classifier with 100 estimators and with
lastly 4) What is the classification performance when no random_state = 42 to handle the randomness. On the other
derived features are used, such as only with F1 + F2 + hand, we built the C5.0, SVM, J48, AdaBoost, and KNN
F3 ? classifiers with their default hyperparameters as per the
• Case V: In this instance, we focus on explaining i.e. Scikit-Learn library.
How does the freshly developed final feature set aid Furthermore, we developed the DNN model which is
the classifier in detecting PDF malware? Furthermore, an MLPClassifier, with a hidden layer of 100 units and
we present an analysis of the distribution of the random_state = 42. We executed the DNN model for
characteristics of the final feature set in the operational 100 epochs. And finally, the Gradient Boosting classifier
dataset to identify a few prospective directions that may (GBC) was introduced with the 100 estimators and ran-
effectively aid in identifying PDF malware. dom_state = 42. While implementing the standard feature
C. CASE II
In this case, we discover the impact of the derived feature
sets on the classifier’s performance as well as how much the
derived features incorporate for identifying PDF malware.
Table 5 highlights the performance of the Random Forest
FIGURE 6. ROC curve comparison of random forest model with various classifier utilizing both the standard and derived feature sets
classifiers adopted in this study on the standard feature set F1 .
based on 10-fold cross-validation. The findings explicitly
demonstrate that the derived feature set noticeably improved
the effectiveness of the classifier compared to the standard
feature sets. We observe a nearly 2% increase in accuracy for
the classifier when utilizing the derived feature set F1′ instead
of the standard feature set F1 . Similarly, the other derived
feature sets F2′ , and F3′ maintain the same consistency for
the accuracy improvement of the classifier as like as the F1′ .
However, for the derived feature set F3′ , we obtain the best
classification report by attaining an accuracy of 98.90% of
the classifier for the detection of PDF malware.
To further assess the significance of the derived features,
we estimate the feature importance of the derived feature
FIGURE 7. ROC curve comparison of random forest model with various sets F1′ , F2′ , and F3′ leveraging the power of the classifier.
classifiers adopted in this study on the standard feature set F2 . Table 6 shows the feature importance of the derived
TABLE 3. An investigation of the Accuracy, Precision, Recall, and F1 - Score of various machine learning methods for standard feature sets utilizing tools
(PDFiD, PDFINFO, and PDF-PARSER) based on 10-fold cross-validation.
feature sets and explicitly exhibits the significance of the final feature set through careful observations. Besides,
derived features and how much they contribute during the we investigate the effect of introducing the final feature set
classification. The results reveal that the derived feature on the classifier’s performance. Furthermore, we estimate the
Headerlength contributes most for all the derived feature sets significance of the final feature set in the identification of
i.e. the feature Headerlength contributes 28.19%, 34.15%, PDF malware.
and 30.85% when F1′ , F2′ , and F3′ are utilized respectively To identify the best feature subsets from the derived feature
for the classification activities, Likewise, another important sets we follow the steps mentioned in the Algorithm 1.
derived feature Malicecontent turns out to be in top three We use the findings highlighted in Table 6 where the feature
in terms of its significance among all the features within importance of each derived feature set is estimated and then
the derived feature sets F1′ and F2′ by exhibiting 12.6%, sorted according to their importance. We consider the features
and 17.10% contributions respectively for the classification with at least 1% of the feature importance score to generate
purpose. However, we observe the Malicecontent feature as subsets from the derived feature sets F1′ , F2′ , and F3′ . Thus,
the second most important feature yielding 9.7% significance we find the top 15 features from F1′ . top 10 features from F2′ ,
when the classifier utilizes the F3′ for the identification of PDF and top 18 features from F3′ that satisfy the aforementioned
malware. Apart from the Headerlength and Malicecontent condition, and these features are selected for the generation
derived feature, we encounter seldom contributions from of feature subsets initially. We generate the subsets starting
other derived features, though we find the small content from the first feature and gradually increase the features
feature assisting a little to the classifier among the rest of the sequentially up to the last feature considered for feature
derived features. subset generation from each derived feature set.
Therefore, we build 15 subsets from F1′ , 10 subsets
D. CASE III from F2′ , and 18 subsets from F3′ respectively. Further,
In this instance, we identify the features from the derived we follow the steps as mentioned in Algorithm 1 to estimate
feature sets F1′ , F2′ , and F3′ that are selected for the the effectiveness of each feature subset to select the best
TABLE 5. Improvement of random forest classifier’s performance using accuracy of 97.91% when applying the subset consisting of
derived feature sets compared to the standard feature sets for detecting
PDF malware based on 10-fold cross-validation. 8 top features from F2′ , and there is a considerable variation
in the mean accuracy for other subsets of F2′ . However,
we notice in Fig. 11 that the classifier achieves the highest
mean accuracy when utilizing all of the top 18 features
considered for subset generation from F3′ . Thus, the subset
comprising the top 11 features of F1′ is identified as the best
feature subset from F1′ , the top 8 features of F2′ is identified as
the best feature subset from F2′ , and the top 18 features of F3′ is
identified as the best feature subset from F3′ . To accommodate
feature subset from each derived feature set for final feature the commonness as well as the uncommonness among the
set generation. We highlight the mean accuracy obtained newly identified best feature subsets, we perform a union
utilizing the aforementioned approach for each feature subset operation to generate the final feature set. Table 7 represents
of F1′ . F2′ , and F3′ in Fig.9, Fig. 10, and Fig. 11 respectively. the list of the identified features that are finally considered
According to the depiction in Fig. 9, the classifier has a for the final feature set. We discover three derived features
maximum mean accuracy of 98.69% when implementing the such as Headerlength, Malicecontent, and small content into
subset consisting of 11 top features from F1′ , and there is a the final feature set. Since the JavaScript feature is found
small variation in the mean accuracy for other subsets of F1′ . in all three derived feature sets F1′ , F2′ , and F3′ , we only
Likewise in Fig. 10, the classifier yields a maximum mean consider this feature once (from F1′ ) in the list. The features
13846 VOLUME 12, 2024
G. M. S. Hossain et al.: PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
/JS, startxref, xref are noticed both in F1′ , and F3′ , so we take
them only once (from F1′ ) in the list of the final feature set.
We observe the features obj, endobj, stream, /OpenAction,
and /XFA only from F1′ in the final feature set. Besides the
features Filesize_kb, MetadataStream, Optimized, and Pages
are encountered only from F2′ , and the rest of the features
listed in Table 7 are observed only from F3′ . We investigate
the impact of the final feature set on the Random Forest
classifier based on 10-fold cross-validation to detect PDF
malware. Table 8 shows the findings of the classifier for
several types of feature sets used in this research. We notice FIGURE 11. Mean accuracy of random forest classifier vs top feature
an impressive increase in the accuracy of the classifier due subset of the derived feature set F3′ .
to the utilization of the final feature set compared to the
standard and derived feature sets. The maximum accuracy
improvement for the classifier when employing the final folds of the 10-fold cross-validation is depicted in Fig. 13.
feature set is 2.71% compared to the standard feature set We observe the maximum loss during the third fold whereas
F2 . On the contrary, we find that the minimum accuracy the minimum loss during the first fold from the entire 10-fold
improvement of the classifier when executing the final feature cross-validation. Fig.14 represents the Reciever Operating
set is 0.34% compared to the derived feature set F3′ . However, Characteristics (ROC) curve of the Random Forest model for
the classifier provides a noticeable performance boost in the the various folds of the 10-fold cross-validation on the final
case of PDF malware detection, due to the introduction of the feature set. We notice an area under the curve of 1.00 for
freshly developed final feature set. the Random Forest model throughout the entire 10-fold
cross-validation.
FIGURE 12. Accuracy curve of random forest model on final feature set
based on 10-fold cross-validation.
TABLE 6. Feature importance of derived feature sets F1′ , F2′ , and F3′ .
TABLE 7. List of identified features of the final feature set. TABLE 9. Impact of combined feature set as well as the feature set with
no derived features (i.e. F1+ F2+ F3) on the random forest classifier’s
performance based on 10-fold cross-validation for PDF malware
detection.
using a popular machine learning software Weka 3 [47]. TABLE 10. Feature importance of top features of combined feature set.
We find the following features in the best subset as output
from the method: Headerlength, contentcorrupt, /Encrypt,
Malicecontent, /Colors, Metadata Stream, Optimized, Page
size:_A4, Page size:_miscsize, /Size, and /Action. Similarly,
we also implement the ReliefFAttributeEval feature selection
method (with Ranker approach) to produce the best subset
from the combined feature set using Weka 3. We consider
the features having at least a 1% merit score and then
adopt the approach as mentioned in CASE III to develop
and evaluate the feature subset. Finally, we identify the best
subset from this approach having the following features:
Headerlength, Optimized, Malicecontent, Metadata Stream,
Tagged, /EmbeddedFile, Custom Metadata, Form:_none,
/FontDescriptor, /XFA, /Font, small content, /Producer,
/AcroForm, /ModDate, %EOF, Form:_XFA, /XML, /Action,
/CreationDate, and xref. To further evaluate the potency of
the final feature set, we employ these feature sets using the
classifier’s strength and compare the results. We find that for
the subset of CfsSubsetEval, the classifier yields an accuracy
of 97.59% whereas for the subset of ReliefFAttributeEval, the
classifier provides 98.75% accuracy. Notably, we identify that
the classifier produces the highest accuracy when using the
final feature set to detect malicious PDFs.
Overall, we find that from CASE I, the classifier delivers
the highest accuracy of 97.19% on the standard feature set
F3 , from CASE II, the classifier yields the highest accuracy
of 98.90% on the derived feature set F3′ , from CASE III the classifier which is illustrated in Fig. 16. This illustration
classifier produces the best accuracy of 99.24% utilizing the uncovers the important features and how much they con-
final feature set, and from CASE IV the classifier outputs the tribute to the classification activities. The illustration reveals
highest accuracy of 99.19% on the combined feature set with that the derived features Headerlength, and Malicecontent
derived features. Thus, conspicuously we can identify that the contribute largely to the identification of PDF malware.
final feature set assists the classifier to deliver the highest On the other hand, /JS and /Javascript features are also proven
efficacy for detecting PDF malware among all the feature to be very crucial for malicious PDF detection.
sets. We observe all the features from the newly developed final
feature set to explore the traits of both categories of PDFs
toward these features in our operational dataset. The average
value of the derived feature Headerlength is 16.50 with a
standard deviation of 12.45 for the benign PDFs whereas for
the malicious ones, the average value is 42.48 with a standard
deviation of 7.28 as depicted in Fig.17. The illustration also
reveals that the title length of 75.83% of the benign PDFs is
under or equal to the mean value of benign ones while on the
other hand 89.16%, malicious PDFs satisfy the condition of
their title length less or equal to their mean value. However,
this explains the fact that in our operational dataset, the
average title length of the benign PDFs is much smaller
FIGURE 15. Mean accuracy of random forest classifier vs top feature
subset of the combined feature set. than the malicious ones. This finding provides a potential
indication of identifying PDF malware by just looking at
the length of the title of the PDF, though this feature alone
F. CASE V does not necessarily point to the maliciousness within a PDF.
In this case, we provide an explanation of how the freshly Because, in a real-world scenario, a clean PDF often may
created final feature set contributes to the classifier for have a large title length.
identifying maliciousness in PDF. To analyze the significance The distribution of another derived feature Malicecontent
of the final feature set, we estimate the importance of the that is constructed through inspecting the triggering features
top features from this feature set utilizing the Random Forest across the malicious and benign PDFs is illustrated in Fig. 18.
FIGURE 16. A bar plot to display the importance of the top features of the final feature set.
FIGURE 22. A decision tree from one of the estimators of the Random Forest classifier used to detect PDF malware utilizing the final feature set.
and harmful PDFs. As we look into Table 11, we find that the
first decision rule signifies that if the PDF sample does not On the contrary, we find comparatively less effective rules at
contain any /Javascript and its title length is less or equal to the bottom of Table 11 (such as Rule ID 9 to 15) where the
39.5 then the PDF sample falls into the benign category. This rules yield a confidence level of around 90% to 98% covering
rule accurately identifies 7154 PDFs from our operational a small number of samples to clearly identify them as clean
dataset, suggesting that the rule has 100% confidence. Similar files.
to the first decision rule, all the other rules of the table can be Looking at Table 12, we see that in the case of harmful
explained and interpreted by humans to clearly detect benign PDF detection, a far larger number of requirements must
PDFs. Moreover, we find several strong rules (such as Rule be verified than in the case of benign PDF detection.
ID 2 to 8) that yield 99% to 100% confidence as well as Despite the large constraints, we find a few strong rules
cover a wide range of samples for identifying benign PDFs. (such as Rule ID 1 to 5) that offer nearly 100% confidence
in recognizing thousands of samples from our operational PeePDF PDF analysis tools to extract the potential features
dataset as malicious. On the contrary, we identify somewhat that were critical for the classification task. However, their
less effective decision rules that only apply to a very tiny suggested strategy provided a maximum accuracy of 97.4%
number of samples from our dataset as we carefully examine in completing the task. Besides, the authors in [4] proposed
the rules from 6 to 10 in the table. Similar to the explanation an approach called O-DT (Optimizable Decision Tree) for
indicated in Table 11, the decision rules of Table 12 can PDF malware detection. The authors utilized a benchmark
be extensively described and interpreted. For instance, the dataset to perform their intended experiments and got an
first rule of Table12 states that a PDF sample is considered accuracy of 98.84% for the suggested approach. The study
malicious if it lacks the /XML feature, is not optimized, has a in [46] presented a dataset consisting of 10,025 PDF samples
file size greater than 1.41 kilobytes and less than or equal to based on evasive characteristics of PDF files. Moreover, the
46.52 kilobytes, a title length greater than 39 and less than or authors suggested an ensemble classifier based on stacking
equal to 63, contains a cross-reference table less than or equal learning which provided 98.69% accuracy to detect PDF
to 1.5 times, and does not have the /CreationDate feature. The malware. However, we notice that our work outperforms the
rule correctly identifies 5079 malicious PDF samples from existing works presented in Table 13, covering a large number
the dataset. of PDF samples with the use of three advanced tools for
feature extraction, deriving important features, and utilizing
the power of the Random Forest classifier to achieve a
much better accuracy of 99.24% for detecting PDF malware.
Furthermore, to the best of our knowledge, none of the
research presented in Table 13 gives a thorough human
interpretation of the classifier’s performance by illustrating a
decision tree and identifying decision rules for PDF malware
detection.
V. DISCUSSION
We created a dataset considering the malicious, clean, and
evasive PDFs to detect PDF malware. However, we take
FIGURE 25. Number of decision rules derived from decision trees of the only 792 evasive PDFs which is approximately 5% of
random forest classifier.
the entire dataset. The intuition of introducing the evasive
Similarly, the rest of the decision rules of Table 12 can PDFs is to reduce the bias of the classifier. Moreover, the
be explained and explained and interpreted for recognizing evasive characteristics of the PDFs make the classifier more
malicious PDFs. Nevertheless, these crucial decision rules robust in the detection of PDF malware. We maintained an
mentioned in Table 11 and Table 12 can significantly con- approximately balanced distribution between the benign and
tribute to a clear understanding of humans for categorizing malicious PDFs to overcome the problem of skewness of
benign and malicious PDFs. the classifier to a particular class. To analyze the PDFs and
Finally, to assess this study of PDF malware detection, extract the useful features, we used three well-known and
we perform a comparison with various existing works of the highly accurate tools PDFiD, PDFINFO, and PDF-PARSER.
same study discipline. Table 13 summarizes this comparative The idea of using these tools is to develop an effective
study, in which our work is evaluated from a variety of feature set for PDF malware detection by exploring multiple
perspectives, including the PDF sample source, the number efficient tools that ensure the acceptability of the extracted
of samples considered for the study, PDF labels, PDF analysis standard features of the PDFs. Additionally, we derived a
tools, the total number of PDF features considered for the few important features and merged them with the standard
study, the number of derived features developed for the features to generate a merged feature set. We built the final
analysis, the machine learning model used in the study, feature set by generating subsets from the merged feature set
the accuracy observed during the study, and whether the and assessed them utilizing the strength of the classifier.
study provides decision rules and human interpretation. We performed an in-depth experimental analysis of the
The authors of [1] described a method that used machine final feature set to explain the traits of the feature set.
learning classifiers to evaluate a given PDF both statistically We found the title length of the PDF files is crucial,
and interactively to identify the hazardous nature of the as malicious files tend to have an unusual length of the title
document. They ran their trials on 1200 PDF samples compared to clean files. Furthermore, metadata and structural
with PhoneyPDF (a PDF analysis tool) and discovered that features such as /Producer, /ProcSet, /ID, /CreationDate
the Random Forest classifier was the best fit to detect etc. are frequently observed inside the clean PDFs and
malicious PDFs, with an accuracy of 98.6%. Similarly, the seldomly found within the harmful ones. Attackers usually
authors in [33] implemented the Random Forest classifier to keep fraudulent files as small as possible by restricting the
identify malicious PDFs. They only used 1000 PDF samples contents, pages, fonts, and size with the intuition of carrying
in their pilot investigation and employed the PDFiD and out their attacks as swiftly as possible. We discovered that
TABLE 11. Top decision rules extracted from random forest classifier to detect clean PDF.
cyber criminal’s primary intention is to insert malice-related the strength of the classifier can be enhanced to combat
contents such as inserting JavaScript code, OpenAction files, modern advanced attacks more precisely if we can include
etc. within the structure of the PDFs to harm the victim’s more evasive PDF properties through careful inspection.
systems. We extracted characteristics in this experiment using three
We provided an explicit interpretation and explanation of tools: PDFiD, PDFINFO, and PDF-PARSER. These tools
the classifier’s performance by generating a decision tree while very popular, are known to have vulnerabilities (for
from one of the classifier’s estimators, as well as highlighting instance, PDFiD tool) to some attacks. One such attack is
a few critical decision rules for recognizing malicious and known as the parser confusion attack where the fraudulent
clean PDFs. We discovered some strong decision rules for material is disguised and concealed using a variety of
recognizing both types of PDFs that provide up to 100% approaches to avoid detection while retaining the ability
confidence and can identify a large number of samples; to execute and exploit. Also, run-time and other dynamic
nevertheless, we noticed a number of rules that require a characteristics may be leveraged to further investigate
significant number of constraints to be verified, making them questionable documents. We intend to address each of these
somewhat less effective. In addition to that, these weak rules constraints in our future work. Additional analysis can be
can accurately identify a small number of samples yet yield performed by combining aspects from various parsers and
high confidence. However, the decision rules offer a clear analysis techniques to investigate complicated content such
understanding and interpretation of how the features can be as JavaScript code.
utilized to detect PDF malware. Furthermore, the present feature set is derived from three
In this study, we added evasive behaviors to our experimen- extraction methods, and the features employed by the three
tal dataset to make our classifier more resilient. Nevertheless, programs depend on heuristics and insights made by their
TABLE 12. Top decision rules extracted from random forest classifier to detect malicious PDF.
TABLE 13. Comparision of our work with various existing studies for PDF malware detection.
developers. A greater comprehensive feature set can be scenarios or simulations to justify its practical effectiveness in
added by incorporating new sources, such as malicious identifying various types of PDF malware. Besides, we intend
document generation tools or in-depth study of malicious to investigate certificateless signcryption [48] and proxy
PDF documents. One such analysis can be to consider the signcryption [49] as advanced strategies for safeguarding
internal text of malicious PDFs where the attackers can PDFs which can add additional layers of security for
hide their harmful code segment behind the text content. PDFs that could potentially mitigate the risks posed by
Also, we want to assess the generalizability of our suggested PDF malware and leading the way for future research that
method against multiple types of PDF malware by investigat- integrates cryptographic techniques with malware detection.
ing how the model performs against different types of PDF In addition to that, adversarial PDF malware still poses a great
malware, including newer or more advanced variations. Plus, threat to a secure cyberspace. In the future, to combat such
we want to implement the proposed method in real-world threats we want to develop a data-driven intelligent approach
that can tackle adversarial PDF malware effectively. Besides, [13] S. Atkinson, G. Carr, C. Shaw, and S. Zargari, ‘‘Drone forensics: The
we want to publish an additional dataset comprised solely of impact and challenges,’’ in Digital Forensic Investigation of Internet of
Things (IoT) Devices. Cham, Switzerland: Springer, 2021, pp. 65–124.
evasive PDF samples covering a wide range of approaches to [14] C. Liu, C. Lou, M. Yu, S. M. Yiu, K. P. Chow, G. Li, J. Jiang, and W. Huang,
cyber attacks. ‘‘A novel adversarial example detection method for malicious PDFs using
multiple mutated classifiers,’’ Forensic Sci. Int., Digit. Invest., vol. 38,
Oct. 2021, Art. no. 301124.
VI. CONCLUSION [15] Q. A. Al-Haija and A. Ishtaiwi, ‘‘Machine learning based model to identify
In this study, we performed an extensive analysis for firewall decisions to improve cyber-defense,’’ Int. J. Adv. Sci., Eng. Inf.
PDF malware detection. For this, we first developed a Technol., vol. 11, no. 4, p. 1688, Aug. 2021.
[16] D. Stevens. (2023). PDFid (Version 0.2.8). [Online]. Available:
comprehensive dataset of 15958 PDF samples by taking into https://blog.didierstevens.com/programs/pdf-tools
account the non-malicious, malicious, and evasive natures of [17] PDF-Info. (2021). PDF-Info (Version 2.1.0). [Online]. Available:
the PDF samples. We also developed a method to generate an https://pypi.org/project/pdf-info/
effective and explainable feature set by extracting important [18] D. Stevens. (2023). PDF-Parser (Version 0.7.8). [Online]. Available:
https://blog.didierstevens.com/programs/pdf-tools
traits from our freshly constructed dataset’s PDF samples [19] M. Yu, J. Jiang, G. Li, C. Lou, Y. Liu, C. Liu, and W. Huang, ‘‘Malicious
using multiple PDF analysis tools. Further, we also derived documents detection for business process management based on multi-
features that are empirically demonstrated to be useful for layer abstract model,’’ Future Gener. Comput. Syst., vol. 99, pp. 517–526,
Oct. 2019.
classifying PDF malware. We investigated different machine [20] H. Pareek, P. Eswari, N. S. C. Babu, and C. Bangalore, ‘‘Entropy and n-
learning classifiers and highlighted the effectiveness of the gram analysis of malicious pdf documents,’’ Int. J. Eng., vol. 2, no. 2,
Random Forest model not only for performance comparison pp. 1–3, 2013.
but also for the explainability analysis with generating [21] C. Smutz and A. Stavrou, ‘‘Malicious PDF detection using metadata and
structural features,’’ in Proc. 28th Annu. Comput. Secur. Appl. Conf.,
decision rules. Moreover, we clarified the behaviors of the Dec. 2012, pp. 239–248.
characteristics in charge of detecting PDF malware and [22] D. Maiorca, G. Giacinto, and I. Corona, ‘‘A pattern recognition system
pointed out a few relevant observations that may aid in the for malicious pdf files detection,’’ in Proc. Int. Workshop Mach. Learn.
Data Mining Pattern Recognit. Cham, Switzerland: Springer, 2012,
detection of hazardous PDF files. Finally, we compared our pp. 510–524.
findings to several state-of-the-art research and highlighted [23] H. Pareek, ‘‘Malicious pdf document detection based on feature extraction
some key observations of our study. and entropy,’’ Int. J. Secur., Privacy Trust Manage., vol. 2, no. 5, pp. 31–35,
Oct. 2013.
[24] D. Maiorca, D. Ariu, I. Corona, and G. Giacinto, ‘‘A structural and content-
REFERENCES based approach for a precise and robust detection of malicious PDF files,’’
[1] S. S. Alshamrani, ‘‘Design and analysis of machine learning based in Proc. Int. Conf. Inf. Syst. Secur. Privacy (ICISSP), Feb. 2015, pp. 27–36.
technique for malware identification and classification of portable [25] N. Šrndić and P. Laskov, ‘‘Hidost: A static machine-learning-based
document format files,’’ Secur. Commun. Netw., vol. 2022, pp. 1–10, detector of malicious files,’’ EURASIP J. Inf. Secur., vol. 2016, no. 1,
Sep. 2022. pp. 1–20, Dec. 2016.
[2] P. Singh, S. Tapaswi, and S. Gupta, ‘‘Malware detection in PDF and office [26] P. Laskov and N. Šrndić, ‘‘Static detection of malicious JavaScript-
documents: A survey,’’ Inf. Secur. J., Global Perspective, vol. 29, no. 3, bearing PDF documents,’’ in Proc. 27th Annu. Comput. Secur. Appl. Conf.,
pp. 134–153, May 2020. Dec. 2011, pp. 373–382.
[3] N. Livathinos, C. Berrospi, M. Lysak, V. Kuropiatnyk, A. Nassar, [27] Z. Tzermias, G. Sykiotakis, M. Polychronakis, and E. P. Markatos,
A. Carvalho, M. Dolfi, C. Auer, K. Dinkla, and P. Staar, ‘‘Robust PDF ‘‘Combining static and dynamic analysis for the detection of malicious
document conversion using recurrent neural networks,’’ in Proc. AAAI documents,’’ in Proc. 4th Eur. Workshop Syst. Secur., Apr. 2011, pp. 1–6.
Conf. Artif. Intell., vol. 35, no. 17, 2021, pp. 15137–15145. [28] C. Vatamanu, D. Gavrilut, and R. Benchea, ‘‘A practical approach on
[4] Q. A. Al-Haija, A. Odeh, and H. Qattous, ‘‘PDF malware detection based clustering malicious PDF documents,’’ J. Comput. Virol., vol. 8, no. 4,
on optimizable decision trees,’’ Electronics, vol. 11, no. 19, p. 3142, pp. 151–163, Nov. 2012.
Sep. 2022. [29] F. Schmitt, J. Gassen, and E. Gerhards-Padilla, ‘‘PDF scrutinizer:
[5] Y. Wiseman, ‘‘Efficient embedded images in portable document format,’’ Detecting JavaScript-based attacks in PDF documents,’’ in Proc. 10th
Int. J., vol. 124, pp. 38–129, Jan. 2019. Annu. Int. Conf. Privacy, Secur. Trust, Jul. 2012, pp. 104–111.
[6] M. Ijaz, M. H. Durad, and M. Ismail, ‘‘Static and dynamic malware analysis [30] S. Karademir, T. Dean, and S. Leblanc, ‘‘Using clone detection to find
using machine learning,’’ in Proc. 16th Int. Bhurban Conf. Appl. Sci. malware in acrobat files,’’ in Proc. Conf. Center Adv. Stud. Collaborative
Technol. (IBCAST), Jan. 2019, pp. 687–691. Res., 2013, pp. 70–80.
[7] Y. Alosefer, ‘‘Analysing web-based malware behaviour through client [31] X. Lu, J. Zhuge, R. Wang, Y. Cao, and Y. Chen, ‘‘De-obfuscation and
honeypots,’’ Ph.D. dissertation, School Comput. Sci. Inform., Cardiff detection of malicious PDF files with high accuracy,’’ in Proc. 46th Hawaii
Univ., Cardiff, Wales, U.K., 2012. Int. Conf. Syst. Sci., Jan. 2013, pp. 4890–4899.
[8] N. Idika and A. P. Mathur, ‘‘A survey of malware detection techniques,’’ [32] I. Corona, D. Maiorca, D. Ariu, and G. Giacinto, ‘‘Lux0R: Detection of
Purdue Univ., vol. 48, no. 2, pp. 32–46, 2007. malicious PDF-embedded Javascript code through discriminant analysis
[9] M. Abdelsalam, M. Gupta, and S. Mittal, ‘‘Artificial intelligence assisted of API references,’’ in Proc. Workshop Artif. Intell. Secur. Workshop,
malware analysis,’’ in Proc. ACM Workshop Secure Trustworthy Cyber- Nov. 2014, pp. 47–57.
Phys. Syst., Apr. 2021, pp. 75–77. [33] A. Falah, L. Pan, S. Huda, S. R. Pokhrel, and A. Anwar, ‘‘Improving mali-
[10] W. Wang, Y. Shang, Y. He, Y. Li, and J. Liu, ‘‘BotMark: Automated cious PDF classifier with feature engineering: A data-driven approach,’’
botnet detection with hybrid analysis of flow-based and graph-based traffic Future Gener. Comput. Syst., vol. 115, pp. 314–326, Feb. 2021.
behaviors,’’ Inf. Sci., vol. 511, pp. 284–296, Feb. 2020. [34] Virustotal. Accessed: Jun. 18, 2023. [Online]. Available: https://
[11] N. Srndic and P. Laskov, ‘‘Practical evasion of a learning-based classifier: www.virustotal.com/gui/home/upload
A case study,’’ in Proc. IEEE Symp. Secur. Privacy, May 2014, [35] A. Kang, Y.-S. Jeong, S. Kim, and J. Woo, ‘‘Malicious PDF detection
pp. 197–211. model against adversarial attack built from benign PDF containing
[12] D. Maiorca, I. Corona, and G. Giacinto, ‘‘Looking at the bag is not enough Javascript,’’ Appl. Sci., vol. 9, no. 22, p. 4764, Nov. 2019.
to find the bomb: An evasion of structural methods for malicious PDF [36] D. Maiorca and B. Biggio, ‘‘Digital investigation of PDF files: Unveiling
files detection,’’ in Proc. 8th ACM SIGSAC Symp. Inf., Comput. Commun. traces of embedded malware,’’ IEEE Secur. Privacy, vol. 17, no. 1,
Secur., May 2013, pp. 119–130. pp. 63–71, Jan. 2019.
[37] N. Nissim, A. Cohen, C. Glezer, and Y. Elovici, ‘‘Detection of malicious KAUSHIK DEB received the B.Tech. and M.Tech.
PDF files and directions for enhancements: A state-of–the art survey,’’ degrees from the Department of Computer Science
Comput. Secur., vol. 48, pp. 246–266, Feb. 2015. and Engineering, Tula State University, Tula,
[38] M. Xu and T. Kim, ‘‘$PlatPal$: Detecting malicious documents with Russia, in 1999 and 2000, respectively, and
platform diversity,’’ in Proc. 26th USENIX Secur. Symp. (USENIX Secur.), the Ph.D. degree in electrical engineering and
2017, pp. 271–287. information systems from the University of Ulsan,
[39] Y. Chen, S. Wang, D. She, and S. Jana, ‘‘On training robust $PDF$ Ulsan, South Korea, in 2011. Since 2001, he has
malware classifiers,’’ in Proc. 29th USENIX Secur. Symp. (USENIX Secur.), been a Faculty Member of the Department of
2020, pp. 2343–2360.
Computer Science and Engineering (CSE), Chit-
[40] C. Smutz and A. Stavrou, ‘‘When a tree falls: Using diversity in ensemble
tagong University of Engineering and Technology
classifiers to identify evasion in malware detectors,’’ in Proc. Netw. Distrib.
Syst. Secur. Symp., 2016, pp. 1–15. (CUET), Chattogram, Bangladesh, where he is currently a Professor with the
[41] M. Li, Y. Liu, M. Yu, G. Li, Y. Wang, and C. Liu, ‘‘FEPDF: A Department of CSE. Moreover, he was in various administrative positions
robust feature extractor for malicious PDF detection,’’ in Proc. IEEE with CUET, such as the Dean of the Faculty of Electrical and Computer
Trustcom/BigDataSE/ICESS, Aug. 2017, pp. 218–224. Engineering (ECE), from 2017 to 2019, the Director of the Institute of
[42] D. Liu, H. Wang, and A. Stavrou, ‘‘Detecting malicious Javascript in PDF Information and Communication Technology (IICT), from 2015 to 2017,
through document instrumentation,’’ in Proc. 44th Annu. IEEE/IFIP Int. and the Head of the CSE Department, from 2012 to 2015. He made a
Conf. Dependable Syst. Netw., Jun. 2014, pp. 100–111. variety of contributions to managing and organizing conferences, workshops,
[43] N. Šrndic and P. Laskov, ‘‘Detection of malicious pdf files based on and other academic gatherings. He has published more than 110 technical
hierarchical document structure,’’ in Proc. 20th Annu. Netw. & Distrib. articles with peer reviews. His research interests include computer vision,
Syst. Secur. Symp., 2013, pp. 1–16. deep learning, pattern recognition, intelligent transportation systems (ITSs),
[44] Canadian Institute for Cybersecurity (CIC). (2022). PDF dataset: and human–computer interaction. He was a Steering Member. He acted
CIC-Evasive-PDFMAL2022. [Online]. Available: https://www.unb. as the Chair or a Secretary in a variety of international and national
ca/cic/datasets/pdfmal-2022.html conferences, such as the International Conference on Electrical, Computer,
[45] (2013). Contaigo, 16,800 Clean and 11,960 Malicious Files for Sig- and Communication Engineering (ECCE), the International Forum on
nature Testing and Research. [Online]. Available: http://contagiodump. Strategic Technology (IFOST), the International Workshops on Human
blogspot.com/2013/03/16800-clean-and-11960-malicious-files.html
System Interactions (HSI), and the National Conference on Intelligent
[46] M. Issakhani, P. Victor, A. Tekeoglu, and A. Lashkari, ‘‘PDF malware
Computing and Information Technology (NCICIT).
detection based on stacking learning,’’ in Proc. 8th Int. Conf. Inf. Syst.
Secur. Privacy, 2022, pp. 562–570.
[47] E. Frank, M. A. Hall, and I. H. Witten, ‘‘Data mining: Practical
machine learning tools and techniques,’’ in The WEKA Workbench,
4th ed. San Mateo, CA, USA: Morgan Kaufmann, 2016. [Online].
HELGE JANICKE received the Ph.D. degree from
Available:http://www.cs.waikato.ac.nz/ml/weka/book.html De Montfort University (DMU), U.K., in 2007.
[48] I. Ullah, N. Ul Amin, M. Zareei, A. Zeb, H. Khattak, A. Khan, He is currently a Professor in cybersecurity with
and S. Goudarzi, ‘‘A lightweight and provable secured certificate- Edith Cowan University (ECU), Australia. He is
less signcryption approach for crowdsourced IIoT applications,’’ Sym- also the Director of the Security Research Institute,
metry, vol. 11, no. 11, p. 1386, Nov. 2019. [Online]. Available: ECU, and the Research Director for Australia’s
https://www.mdpi.com/2073-8994/11/11/1386 Cyber Security Cooperative Research Centre.
[49] A. Waheed, A. I. Umar, M. Zareei, N. Din, N. U. Amin, J. Iqbal, He established DMU’s Cyber Technology Insti-
Y. Saeed, and E. M. Mohamed, ‘‘Cryptanalysis and improvement of a tute, DMU, and its Airbus Centre of Excellence
proxy signcryption scheme in the standard computational model,’’ IEEE for SCADA cybersecurity and digital forensics
Access, vol. 8, pp. 131188–131201, 2020. research, and heading up DMU’s School of Computer Science. His research
interests include cybersecurity in critical infrastructure, human factors of
cybersecurity, the cybersecurity of emerging technologies, digital twins, and
the Industrial IoT.