Research Borg
Research Borg
Knut Borg
Master’s Thesis
Master of Science in Information Security
30 ECTS
Department of Computer Science and Media Technology
Gjøvik University College, 2013
Avdeling for
informatikk og medieteknikk
Høgskolen i Gjøvik
Postboks 191
2802 Gjøvik
Knut Borg
2013/06/02
Real time detection and analysis of PDF-files
Abstract
The PDF-file format is a very popular format to perform attacks with due to the format being quite
versatile. A PDF-file can be used as direct attacks against specific targets like the government,
the army or other high value targets. These kinds of attacks may be performed by foreign
intelligence or by organised crime because they have the most to gain by a successful attack.
The attacks are often well obfuscated which makes it easy for users to unintentionally execute
the malware on his/her machine. A PDF-file may for instance contain a well written report with
important information to the user [1], but do also contain malicious code in order to perform
reconnaissance on the target’s network.
This master thesis is a continuation of the results of Jarle Kittilisen’s master thesis in 2011. The
thesis will utilize Kittilsen’s proposed methodology by using the machine learning tool ’support
vector machine’ in order to classify PDF-files as malicious or benign. This thesis will focus on
online detection of PDF-files where as Kittilsen performed post-detection. One of the biggest
problems with an online detection of PDF-files is the time frame from the PDF-file is detected
until it has been classified as either malicious or benign. This master thesis seek to provide
answers for the viability of an online detection system of PDF-files.
iii
Real time detection and analysis of PDF-files
Sammendrag
En PDF-fil kan bli brukt som et direkte angrep mot spesifikke mål som f.eks. regjeringen, militæret
eller andre verdifulle mål. Slik angrep kan bli utført av organiserte kriminelle eller utenlandske
etterretningstjenester fordi disse gruppene kan tjene mye på et suksessfullt angrep. Angrepene
er ofte godt gjemt slik at sannsynligheten for at brukere uvitende kjører skadelig kode på deres
PC-er er stor. En PDF-fil kan f.eks. inneholde en godt skrevet rapport med viktig informasjon som
er relevant for brukeren [1], men PDF-filen kan også inneholde kode som kan gjøre det mulig
for angriper å rekogniserer nettverket som brukeren befinner seg på.
Denne masteroppgaven er en videreutvikling basert på resultatene i Jarle Kittilsens
masteroppgave fra 2011 [2]. Masteroppgaven vil bruke Kittilsens foreslåtte metode om å bruke
maskinlærings verktøyet ’support vector machine’ for å kunne klassifisere PDF-filer som godartet
eller skadelig. Masteroppgaven vil fokusere på muligheten for et online deteksjonssystem av PDF-
filer fordi Kittilsen fokuserte på deteksjon av PDF-filer i etterkant av at filene hadde kommet fram
til mottakerne. Et av de største problemene til et online deteksjonssystem er tidsbruken fra en
PDF-fil blir detektert til den har blitt klassifisert som godartet eller skadelig. Denne masteroppgaven
ønsker å finne svar på hvorvidt et online deteksjonssystem for PDF-filer er en reell mulighet.
v
Real time detection and analysis of PDF-files
Acknowledgements
I would like to thank my supervisor Prof. Katrin Franke for providing the master thesis topic.
Franke provided guidance and assistance throughout the project, as well as insight and constructive
criticism. I would also like to thank Jayson Mackie for technical support, ideas and tips in regards
to the master thesis report.
A big thanks to my class mates at GUC, including both the graduation students and the first-
year master students, for ideas and feedback regarding the master thesis.
Finally I would like to thank my family for motivation and support throughout my studies at
Gøvik University College.
vii
Real time detection and analysis of PDF-files
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Sammendrag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Justification, Motivation and Benefits . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Exclusion of JavaScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 PDF Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Use of Machine Learning Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Choice of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Online Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Lower Level Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Portable Document Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Snort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Extraction From Memory Locations . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Hard drive, SSD and Ramdisk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.7 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7.1 Finding Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7.2 Naive String Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.7.3 Improved String Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7.4 Multi-threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.8 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Snort and Extraction of PDF-file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Hard drive, SSD and RamDisk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1 Finding Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
ix
Real time detection and analysis of PDF-files
x
Real time detection and analysis of PDF-files
List of Figures
xi
Real time detection and analysis of PDF-files
xii
Real time detection and analysis of PDF-files
List of Tables
xiii
Real time detection and analysis of PDF-files
Glossary
• BMH - Boyer-Moore-Horspool
• Naive - In this thesis the term is used about a brute force methodology to solve a problem
• I/O - Input/Output (information to/from keyboard, hard drives e.t.c)
• Ramdisk - A chunk RAM which can be used as a normal storage medium
• Online - Used about a system that works close to real time, but may have some delay.
• SSD - Solid State Drive
• Substring - Used when talking about specific word/text being searched for in a larger string
of text/data.
• SVM - Support Vector Machine
xv
Real time detection and analysis of PDF-files
1 Introduction
This chapter serves as an introduction for the author’s master thesis topic. It will give a brief
explanation about the topic itself, challenges, justifications as well as what kind of new knowledge
the master thesis seeks to provide.
1.2 Keywords
Online detection system, PDF-file analysis, pattern matching algorithms, feature extraction,
support vector machines, computer network security.
1
Real time detection and analysis of PDF-files
By improving the analysis methodology one could prevent these malicious PDF-files from
reaching the user, which prevents them from unintentionally run malicious code. Not only will it
prevent potential damage caused to the computer systems and databases, but it can also prevent
the attacker in gathering non-/sensitive information from the business or organisation[1].
1.6 Contributions
This master thesis seeks to provide more knowledge about online analysis of a PDF-file. Every
element from Snort detecting and logging a PDF-file to the PDF-file being classified as benign or
malicious will be explained. Achievements and problems will be explained, as well as pointing
out possible counter measures to problems discovered. The focus of this thesis is aimed at the
PDF-file format because the format have seen a high increase in exploits during the recent years
[10]. While there have been done research in regards to analyse a PDF-file for malicious content,
it has to the authors knowledge not been done any significant research towards online analysis
of PDF-files captured directly from network traffic.
2
Real time detection and analysis of PDF-files
• Chapter 2 provides and overview of what kind of related research have previously been
done. The chapter starts with giving an overview of general research towards the PDF-format.
Machine learning tools are mentioned in the last section of the chapter.
• Chapter 3 explains why the author chose to use a certain kind of method in order to solve a
problem.
• Chapter 4 explains how the author implemented the different parts of the detection system.
• Chapter 5 shows the different results the author got after performing different experiments.
• Chapter 6 Provides a summary of the project and what conclusion the author have drawn.
3
Real time detection and analysis of PDF-files
2 Related Work
This chapter seeks to provide information about relevant research in regards to the author’s
master thesis. The author believes he have made it quite clear that Jarle Kittilsen [2] has done
quite a lot in regards to the related work department and the author feels it is unnecessary to
repeat Kittilsen’s contributions. The author wishes to focus on other researchers’ contributions.
Stevens also presents a list of PDF-features and explains what they do and if those are a
sign of malicious content. For instance the features ’/AA’ and ’/OpenAction’, which indicates
an automatic action is to be performed, are very suspicious if the same PDF-file also contain
JavaScript. JavaScript in a PDF-file are represented by either ’/JS’ or ’/JavaScript’.
5
Real time detection and analysis of PDF-files
According to Stevens these are the most important features when detecting malicious PDF-
files [12]:
• /Page - Number of pages inside a PDF-file. Most malicious PDF-files have only one page.
• /JS, /JavaScript - Indicate utilisation of JavaScript
• /RichMedia - Indicate utilisation of Flash.
• /AA, /OpenAction - Indicate an automatic action is to be performed. Often used to execute
JavaScript.
• /Acroform - Number of embedded forms.
• /JBIG2Decode, /Colors (with a larger value of 224 ) - Indicate utilisation of vulnerable filters.
A researcher named Paul Baccas has published his findings for analysing malicious PDFs [13].
For JavaScript he found that out of 64.616 PDF-files containing JavaScript, only 1093 of them
were benign. 98% of the PDF-files containing JavaScript were malicious and this means that
JavaScript is a good indication that PDF-file may be malicious. Baccas continued with looking for
a mismatch between ’obj’ and ’endobj’ and a mismatch between ’stream’ and ’endstream’.
Out of 10.321 PDF-files containing mismatched objects, there was 8.685 malicious PDF-files. This
result shows that only 16% of the PDF-files containing mismatched objects are benign, which
means that the occurrences of mismatching objects in PDF-files may serve as an indicator of the
file being malicious. For mismatch between ’stream’ and ’endstream’ he found that 1.585 PDF-
files were malicious out of 2.296 which gives a malicious rate of 69%. Almost 3 out of 4 PDF-files
were malicious and this means a mismatch between ’stream’ and ’endstream’ can serve as an
indicator of malicious presences.
There are also several tools available. These tools are mostly developed in spare time by
people working in the IT security branch. Here is a brief overview of some tools available:
• PeePDF is a tool written in Python by Jose Miguel Esparza and it provides similar capabilities
in analysing PDF-files like Steven’s PDFiD, however PeePDF can also create new PDFs or edit
existing ones [14]. The purpose of the tool was to create a complete tool set instead of having
to rely on three or four separate tools.
• PDFxray [15] is an analysis tool where one can upload your malicious PDF-files on PDFxray’s
website, however the site is currently down for maintenance at the time of writing. As with
the Jsunpack tool, where source code is hosted at Google [16], PDFxray can be compiled for
private use from Github [17].
• PDF Scrutiniser [18] is a tool which uses static and dynamic detection mechanisms, i.e.
statistical analysis and executing of malicious code. The tool also attempted emulate a PDF
reader’s behaviour with success according to the authors.
6
Real time detection and analysis of PDF-files
The paper named ’Static Detection of Malicious JavaScript-Bearing PDF Documents’ [4] explains
the authors’, of said document, take on using SVM to analyse malicious JavaScript in PDF-files.
The authors used a method called "One-Class Support Vector Machine" (OCSVM) which they
claimed to be a very good option when trying to classify JavaScript being malicious or not.
The idea is that the SVM only need examples from one class in order to build a classification
model and therefore improve the classification performance. Figure 2 shows how the learning
and classification is performed [4]. During the learning process, all samples of benign PDF-files
are being mapped in a high-dimensional hypersphere. Then the OCSVM tries to find the center
"c" and the radius "R". When the JavaScript is being classified, the OCSVM checks if the new data
point’s distance is longer or shorter than ’R’. If the distance is shorter than ’R’, the data point will
be treated as benign. However if the the distance is longer than ’R’, then the data point will be
treated as malicious.
The paper ’Obfuscated Malicious Javascript Detection using Classification Techniques’ [5]
proposed methods to detect obfuscation in JavaScript, since obfuscation is often a sign of attackers
trying to hide their malicious code. In the experiments the authors used Naive Bayes, ADTree,
SVM and RIPPER. What they found was that the machine learning classifiers could produce
highly accurate results. Figure 3 shows a table of results from their first experiment with classifying
JavaScript. The first column shows precision which is the ratio of (malicious scripts labeled
correctly)/(all scripts that are labeled as malicious). The second shows recall which is a ratio
of (malicious scripts labeled correctly)/(all malicious scripts). The third column contains a "F2"-
score which combines precision and recall, but valuing the recall value twice as much as precision.
"F1"-score treats both equally and "F0,5" value precision twice as much as recall [19]. The
last column is Negative Predictive Power (NPP) which is the ratio of (benign scripts labeled
7
Real time detection and analysis of PDF-files
correctly)/(all benign scripts). As can be seen of figure 3, SVM have the best precision rate of
0.920 which is quite high. The authors were quite happy with the results of applying machine
learning tools to analyse JavaScript. However the author of this document have to emphasis
the fact that the paper in question analysed JavaScripts found in the wild and not necessarily
attached to a PDF-file.
8
Real time detection and analysis of PDF-files
3 Choice of Methods
This chapter provides information about why the author chose to use a specific methodology to
solve a specific problem.
9
Real time detection and analysis of PDF-files
3.4 Snort
Snort is an Intrusion Detection System (IDS) developed and maintained by Sourcefire. It was
released by Martin Roesch in 1998 and has grown in popularity by ca. 400.000 registered
users and over 4 million downloads [25]. Snort’s abilities ranges from protocol analysis, content
matching to detect buffer overflow attacks and stealth port scans. Snort do also have the capability
to perform real-time alerting and outputting the alarms to a user specified file, a WinPopup
message for windows clients or UNIX sockets for UNIX/Linux distributions. Snort can also log
the network session that was triggered by an alarm for analysis purposes.
Jarle Kittilsen used Snort in his detection system [2] and the author wanted to find out if Snort
has the functionality available in order to be a part of an online detection system for PDF-files.
10
Real time detection and analysis of PDF-files
Table 1 shows specifications and features for the different storage devices. The speed, time
and storage capacity information have been gathered from several different sources like PCWorld,
HP and other people who have been doing different benchmark tests [26][27][28]. Note that the
author have not taken the possibility of using a RAID setup into consideration when gathering
information for hard drive information in tabel 1.
As can be seen in table 1 the ramdisk have the potential of incredible high speed, but
the downside is that the ramdisk is a volatile storage medium. If the power shuts down, the
information stored on the ramdisk will be gone. The size of the ramdisk is also limited to the
11
Real time detection and analysis of PDF-files
amount of available RAM the computer have. The SSD is not a volatile storage device, have
no issues with fragmentation and is a lot faster than a regular hard drive. The SSD is however
limited in the amount of times it can write to the same space and it can be quite expensive in
regards to the amount of storage it provides compared to a regular hard drive. The regular hard
drive is slower than its counter parts (SSD and ramdisk), but the hard drive is a lot cheaper per
gigabyte of storage capacity and have the highest possible storage capacity ranging up to four
terabytes. Section 5.1 contain tests and results in regards to the different storage devices.
There isn’t enough time develop a PDF-parser with security in mind, so the author decides that
features will be extracted by performing a normal string matching of feature name patterns in
the PDF-file. There are different ways to perform a string search and there have been written
many papers on trying to achieve the optimal algorithm for single- and multi-pattern matching.
Multi-string pattern matching is often used to check for plagiarism like the algorithm Rabin-
Karp [30][31], while single string pattern matching is used to check the frequency of a string
occurring.
There are different ways to measure the performance of a searching algorithm. One way is
to measure the time difference between algorithms and another is to compare the amount of
if()-sentences executed. An example of an if()-sentence is:
if(string_block[i]==substring_block[i]). The latter makes it easier to spot how the
algorithm perform for specific PDF-features for different PDF-files. By counting how many if()-
sentences are executed for each feature one can see how the algorithm scales to different sized
PDF-files. Bottom line is to try to achieve as few if()-sentences executed as possible and as such
decrease the amount of operations the CPU have to handle. This is important since one of the
main goals of this master thesis is to decrease the time spent overall and because the feature
extraction process itself was the most time consuming process Kittilsen had to handle [2].
The string search process is potentially an easy task and one could have picked different
algorithms from C-libraries, but due to the nature of the PDF-file it is better to implement the
chosen search algorithm from scratch. This gives the author more control of where the algorithm
is in the PDF-file at any given time and makes it easier to double check if the string match found is
actually a valid feature. An example of this is the feature ’/Page’. One ’/Page’ is one "physical"
page in the PDF-file, however ’/Pages’ is a root node containing several ’/Page’s [8]. More
12
Real time detection and analysis of PDF-files
information about finding features, ’/Page’ and ’/Pages’ can be found in 4.3.1.
These are the features Kittilsen proposed to use in order to achieve optimal results:
• /Page: Indicator of one page in the document. Malicious documents tends to contain only
one page.
• /AcroForm: Number of embedded forms.
• /OpenAction and /AA: Malicious files tends to have automatic actions to be performed
without user interaction.
• /RichMedia: Embedding of flash-based content.
• /Launch: Amount of launch actions.
• /JavaScript and /JS: Indicator of a JavaScript.
• startxref: Presences of a startxref statement.
• trailer: Presence of a trailer.
• Mismatch between obj and endobj: Malicious documents may have a mismatch of obj and
endobj.
The author notes that when talking about matching a feature in a PDF-file, the author will
use the word "substring" about the feature name and "string" about the entire PDF-file.
13
Real time detection and analysis of PDF-files
the potential use of multi-threading. It is also impossible to control how many times a single
feature have triggered an if()-sentence because it relies on which feature one have matched one
byte against first. However for single core systems only, this version of a naive string search could
prove to be a lot faster than by searching through the same PDF-file 12 times.
3.7.3 Improved String Search
This subsection will provide information about possible algorithms which will be an improvement
over the naive algorithm.
One method is the KMP algorithm developed by Donald E. Knuth, James H. Morris and
Vaugan R. Pratt [34]. The algorithm is developed in such a way that only one comparison
happens of one character in text T. KMP have pre-compute complexity on O(m2 )[33]. The
substring is compared to the start of the string and then shifted to the right until a substring
match occur.
Another method is the Boyer-Moore algorithm. The Boyer-Moore algorithm changed the
way of how string searching was previously performed by comparing the last character in the
substring (right to left), while scanning the text from left to right. This allowed the algorithm to
skip characters and comparisons by jumping x number of blocks to the right if no character match
was found. This algorithm requires two tables with information. One of the table is called "jump"
and have information about the jump length when a mismatch occurred. The second table is
called "right" where it contains the rightmost index of the string where character t appears[33].
Horspool improved the BM algorithm in 1980 by only require one table compared to the original
two [32]. The table the Boyer-Moore-Horspool (BMH) required was the jump length for each
letter. If the letter compared did not exist in the substring (i.e. not in the "jump" table), then the
jump-length would be the size of the substring. If a match with a character within the substring
was found (i.e. it existed in the "jump" table), then the jump length would be a predetermined
value depending on where the matched character existed in the substring.
The author chose to start with implementing BMH because of its effectiveness and easy to
comprehend algorithm. Other scientific research papers have proposed algorithms that have
increased efficiency for about 10% increase speed [35] compared to the BMH algorithm, but
the author feels that the gain in speed is not significant enough to warrant implementing a more
complex algorithm at this stage. This depends on the average sized PDF-file that needs to be
classified. The result of the feature extractions can be found in section 5.2.
The reason why only the naive and the BMH algorithm was implemented in this master thesis
is because the feature extraction process is not just about finding a match. One must also have
to make sure that the match found is a valid feature. The author had to cross check the results
with other feature extraction tools, Didier Stevens’ tool and Jarle Kittilsen’s python script [3][2],
and the debugging process could be quite exhaustive. The author had to figure out where the
problem occured, either an algorithmic problem or a "fault" with the PDF-file where a substring is
confused to be a valid feature, and it took quite a lot of time to properly ensure that the algorithm
would only count valid features.
14
Real time detection and analysis of PDF-files
3.7.4 Multi-threading
Utilising multi-threading allows the feature extraction process to assign one thread to each
feature to be scanned for. This only works if the computer have multiple CPU-cores available or
utilising the GPU-threads to do the computations instead. An application programming interface
that supports multi-threading is OpenMP [36].
Figure 4 shows that multi-threading can be implemented in different ways [6]. One way is
create a for-loop that will loop through all the features and then OpenMP will fork one subthread
for each iteration of the for-loop resulting in one thread for each feature. An example can be seen
in code-list 1:
1 // code− l i s t 1
2 #pragma omp p a r a l l e l /∗(+ S p e c i f i c p a r a m e t e r s based on your program ) ∗/
3 {
4 #pragma omp f o r /∗(+ S p e c i f i c p a r a m e t e r s based on your program ) ∗/
5 f o r ( i =1; i <13; i++) // Each f e a t u r e
6 {
7 f _ c [ i ] = f e a t u r e _ c h e c k ( f , P_C , i , s i z e , f [ i ] . t y p e ) ; //Number o f matches .
8 }
9 }
15
Real time detection and analysis of PDF-files
The code example of parameters are the following: f_c is an array containing features, f is
the struct where the features are stored, P_C is the PDF-file, i is the number of which a feature
is stored (code-list 1 only), size is of the PDF-file and type tells the algorithm if the feature name
contains ’/’ or not since additional checking have to be performed on features without ’/’ at
the beginning of the feature name like ’obj’ and ’startxref’. More information about the
additional checks can be found in section 4.3.1. For feature extraction, if all of the features had a
similar time cost, the total amount of time spent on feature extraction could have been divided on
12 or the number of features to be extracted. The time saved also depends on how many threads
the CPU have available. A normal quad-core i-7 CPU have 4 threads and 4 virtual threads, which
is not enough when every feature should be searched for simultaneously. An option could be to
utilise the GPU to perform this task. This way the CPU can focus on Snort and logging network
packets, while the GPU takes care of feature extraction process. Kristian Nordhaug wrote a master
thesis about "GPU Accelerated NIDS Search" with Snort in mind with Cuda-technology [37]. By
utilising the GPU instead of CPU, one will suddenly have access to a large amount of threads. This
means one could also use multi-threading inside the feature_check()-function. The PDF-file could
be divided into several smaller sized chunks and then be scanned for one feature individually. If
a PDF-file is split into four chunks, then one would need a maximum of 48 available threads in
order to scan every feature simultaneously. This is not a problem if the feature extraction process
is delegated to the GPU.
While multi-threading could have proved some increased efficiency over single-thread-processing
on the author’s computer (CPU), the real efficiency increase would come from using multi-
threading and running a search for all of the features simultaneously with a GPU implementation.
However the efficiency might decrease somewhat due to scheduling operations. Due to time
constraints, multi-threading was not implemented and the author could not experiment with
Nvidia’s CUDA technology [38].
16
Real time detection and analysis of PDF-files
The author chose to use Jarle Kittilsen’s recommendation by utilising Support Vector Machine[2].
There are some C-code implementations of SVM and these are:
• LibSVM [39]
• LibSVM-light [40]
• Shark [41]
Jarle Kittilisen used PyML which is compatible with the input data structures for LibSVM and
LibSVM-light[42]. The author chose to utilize LibSVM as a starting point. Depending on how
LibSVM performed classification wise, and how much time it consumed, the author would decide
to stick with LibSVM or try out other implementations.
Figure 5: A linear SVM. The circled data points are the support vectors - the samples that are closest to the
decision boundary. They determine the margin with which the two classes are separated [7].
Figure 5 shows an example of how SVM handles a two-class learning problem. One of the
classes is often noted as ’1’ (positive) and the other class as ’-1’ (negative). The dots in the
figure shows the different data points where the red circles belong to one class while the blue
crosses belongs to the other class. It is important to find the optimal decision boundary in order
to get the best classification rate. Kittilsen chose to use the Gaussian kernel which have two
17
Real time detection and analysis of PDF-files
important values one can tweak and these are the "inverse-width" parameter γ and the "penalty
value" C [2]. Kittilsen’s optimal values are 0,1γ and C=100. The author decides to use Kittilsen’s
proposed optimal classifier so more time can be allocated to develop a system suitable for online
implementation.
Figure 6 shows an example of how different values for gamma have an effect on the decision
boundary for the Gaussian kernel.
18
Real time detection and analysis of PDF-files
4 Implementation
This chapter will explain the developing process from capturing network traffic to the classification
of the PDF-file.
• pdf tcp any any <> any any (msg:"PDF detected"; content:"%PDF-";
fast_pattern;tag:session,0,packets,60,seconds; sid:2000011;)
• pdf tcp any any <> any any (msg:"EOF detected"; content:"%%EOF";
fast_pattern;sid:2000012;)
The rule id 2000011, which some of the content was gathered from Kittilsen’s detection rule [2],
detects the PDF-files by looking for the PDF-header in both outgoing and ingoing network traffic.
The rule id 2000012 detects the end of a PDF-file. This alarm is used to alert the author’s software
in order to start the extraction of the PDF-file from Tcpdump’s log-file. The two PDF-detection
rules can be placed in a new rule file called ’pdf.rules’ and be registered in the Snort config-file
as: include $RULE_PATH/pdf.rules
An important note is that the first rule states that Snort will log the following packets in the
specific network session for 60 seconds. This may cause problems in regards to network bandwidth
and is discussed in section 6.1.
19
Real time detection and analysis of PDF-files
The following declares a new rule type based on the form of alert:
1 r u l e t y p e pdf
2 {
3 type a l e r t
4 o u t p u t a l e r t _ f a s t : pdf . a l e r t
5 output a l e r t _ u n i x s o c k
6 }
The new rule type pdf utilise the warning type alert , the log of alarms triggered contain only
basic information and outputs this information to a file called ’pdf.alert’. An alert will also be
sent by using an Unix socket by the name ’snort_alert’. An Unix socket allows two or more
programs to communicate with each other by either establishing a connection between sender
and receiver by using ’SOCK_STREAM ’(TCP like) or just pushing packets with ’SOCK_DGRAM’
(UDP like)[44]. The author chose to use the ’DGRAM’ option because that is what Snort is using
to push out alerts with. To start Snort the following will have to be entered:
snort -c snort.conf
One can add additional parameters to the starting command, like forcing Snort to use
’alert_fast’ and log with Tcpdump, however the author have already enabled these things in
Snort’s config-file. As mentioned in section 3.4, the author is forcing Snort to log with Tcpdump
in the the pcap-format, unlike Kittilsen who used Snort’s own logging format called unified2 [2].
This allows one to skip the process of converting the unified2 log file over to the pcap-format in
order to be able to extract the PDF-file.
1 //−−−−−Source code f o r U n i x _ s o c k e t alarm−−−−−
2 // Waiting f o r t h e PDF alarm t o t r i g g e r
3 // [ 0 ] i s t h e f i r s t b l o c k i n t h e s n o r t alarm message o u t p u t .
4 do
5 {
6 p r i n t f ( " Waiting f o r a l e r t . \ n " ) ;
7 r e c v= r e c v f r o m ( sock , ( void ∗)&a l e r t , s i z e o f ( a l e r t ) , 0 , ( s t r u c t sockaddr ∗) &temp , &
len ) ;
8 p r i n t f ( " [%s ] [%d ] \ n " , a l e r t . a l e r t m s g , a l e r t . e v e n t . e v e n t _ i d ) ;
9 } while ( a l e r t . a l e r t m s g [ 0 ] == ’ E ’ ) ;
10 // Waiting f o r t h e e n t i r e TCP s e s s i o n t o be c a p t u r e d (%%EOF h i t )
11 r e c v = r e c v f r o m ( sock , ( void ∗)&a l e r t , s i z e o f ( a l e r t ) , 0 , ( s t r u c t sockaddr ∗) &temp , &
len ) ;
12 }
The listed source code shows how the author’s program loops until a PDF-header (%PDF-)
is detected. When the PDF-footer (’%%EOF’) is detected, the author’s program will tell Tcpflow
to start extracting the PDF-file from Tcpdump’s log-file. The name of the log-file is "*.log.*"
meaning either "snort.log.1234" or "tcpdump.log.1234". The author’s program starts Tcpflow
with "tcpflow -r [name of log-file]" and then Tcpflow will output a file on the form
of "*.*.*.*.*-*.*.*.*.*". The file is named "[Ip-adress].[port-number]-[Ip-adress].[port-
number]". The author finds the extracted file by using the ’popen()’-function with the input
"find *.*.*.*.*-*.*.*.*.*". Tcpflow’s output file contain the HTTP-header information as
can be seen in figure 7, though the rest of the PDF-file is however intact. The author discovered
that even though the HTTP information was in front of the PDF-header, the file was defined as
a PDF-file and could be opened by a PDF-reader. According to Symantec [45], this is a potential
attack methodology and the author decided it could be interesting to see if any of the PDF-
20
Real time detection and analysis of PDF-files
files in Kittilsen’s benign dataset contained suspicious content in front of the PDF-header. More
information about this can be found in section 5.4.
A small side effect of the PDF-file specification not being picky regarding the PDF-header can
be seen in figure 8. The file shown is the file containing rules for Snort, but it is identified as a
PDF-file because the file begin with a Snort rule looking like this:
pdf tcp any any <> any any (msg:"PDF detected"; content:"%PDF-";
fast_pattern;tag:session,0,packets,60,seconds; sid:2000011;)
I.e. it is missing ’%’ in front of ’pdf’.
21
Real time detection and analysis of PDF-files
After the features have been extracted and the PDF-file have been classified, the Tcpflow’s
output file is removed, however the Tcpdump log-file remains. The Tcpdump log-file proves to
be a problem as more PDF-files are being logged by Snort. Every time a new PDF-file is detected
by Snort, the Tcpdump log-file have to be extracted of its content. After n PDF-files have been
logged, n PDF-files will have to be extracted from the Tcpdump log-file at the same time and
the process becomes a resource hog as new PDF-files are detected. The author has tried several
methods in order to counter this problem:
• Deleting the Tcpdump log-file. The problem is that Snort/Tcpdump won’t recreate the file.
The file is only created on start-up of Snort.
• Deleting the file and create a new one with the same name. The problem is that Snort/Tcpdump
won’t log to the new file.
• Using the ’write()’-function to recreate the file. The good thing is that the file is empty, but
the problem is that it will no longer receive logged packets.
• Using the ’write()’-function to delete the file’s content in order to start over again. The
good thing is that the file is empty. The problem is that when the author logged the same
PDF-file, the Tcpdump log-file would suddenly be twice the size. The author believes that
Snort/Tcpdump has a reference point in the Tcpdump log-file and it is therefore very difficult
to manipulate where in the log-file Snort/Tcpdump should start with dumping network
packets.
The result of the Tcpdump log-file problem can be summarized as follows:
• The extraction of PDF-files becomes heavier (time/space) as additional files are logged.
• Finding the correct PDF-file may prove to become more difficult as new PDF-files are logged,
seeing as each PDF-file will be named with IP-addresses and ports. Duplicates may occur.
The easiest way to counter the Tcpdump log-file problem, and maybe the only way, is to force
Snort to quit after a PDF-file have been logged. Then the used Tcpdump log-file is removed and
Snort can start running again. A new Tcpdump log-file will be created and the system is ready
to receive a new PDF-file. However one can’t implement a system where the IDS itself (Snort)
has to be shut down every time a PDF-file is logged to disk and the author will therefore not
implement the system, in its current form, in an online environment. Another problem caused by
the Snort and the Tcpdump log-file is that Snort can’t receive two or more PDF-files at the same
time because of the following reasons:
• A large PDF-file is being logged by Snort, and during this time a smaller PDF-file is also being
logged. The small PDF-file finishes first and the author’s program have no way of knowing
how to distinguish between the two PDF-files. The author’s program do neither know which
PDF-file have finished being logged to disk.
• A benign PDF-file may have two, or more, PDF-footers (’%%EOF’) which will cause problems
because the author’s program utilize ’%%EOF’ in order to indicate that the PDF-file has been
completely been written to disk. An example can be seen in figure 9.
22
Real time detection and analysis of PDF-files
In regards to checking when the entire PDF-file has been logged to disk, Snort does not have
the ability to control if the PDF-file have finished transferring. There is however a possibility to
check the file size with the HTTP header that is being sent (if available), but Snort doesn’t give
you access to count packets or measure the size of the packets. Checking the HTTP-header is
therefore an unreliable method and can not be utilized. The only way Snort is allowed to split
the Tcpdump log-file is when a given size limit have been set. Since PDF-files can vary in size
from 100KB to several megabytes, there is no point in using this functionality.
Figure 9: Left side: Kittilsen’s original report. Right side: Kittilsen’s report with marked content by the
author. Marking in the PDF-file resulted in a new PDF-file and ’%%EOF’ right under the PDF-header.
23
Real time detection and analysis of PDF-files
24
Real time detection and analysis of PDF-files
/Feature Name
An example of a feature name starting with ’/’ is ’/Page’. Looking for ’/Page’ in Kittilsen’s report
gives a feature count of 162, however there are only 132 pages in the PDF-file. The problem
occurring is that there is also a feature name called ’/Pages’, which describe the structure of
the "page-tree". Figure 10 shows an example of this and the example is gathered from Adobe’s
specification paper of the PDF-file [8].
Figure 10: ’/Pages’ is the root node. Everything below the blue line are objects and each object contains one
’/Page’. The picture shows a PDF-file with 62 pages [8].
In order to prevent ’/Pages’ to be counted along with ’/Page’, one will have to check the
next block after the last character in the feature name. In this case this would be the fifth block
after ’/’. If the block contains the ASCII value of either end-of-line, a space or ’/’ we will know
that the match is most likely a valid feature. The author use the phrase "most likely" because one
can’t be a 100% sure if the match is valid or not. This will be explained in section 6.1 and more
examples will be presented in section 4.3.1.
Figure 11: /AA with what Gedit presents as hex values in the block in front and after the feature name.
The author found another PDF-file [47] and tried to extract features from it. A cross check
with Didier Stevens’ tool [3] showed that the feature ’/AA’ was missing. As can be seen in figure
11, the feature name had a "hex value" in the block after the feature name. The author improved
25
Real time detection and analysis of PDF-files
the validation rule by including the fact as long as the block after the feature name did not
contain an ASCII value of a-z or A-Z, then the chance of it being a valid feature is high.
In order to check feature names with ’/’, one will have to atleast check the following bytes:
[/][F][e][a][t][u][r][e][?]
Normal Feature Name
These features do not start with ’/’ which makes it slightly harder to perform matching validation.
An example are the two features named ’obj’ and ’endobj’. Kittilsen used a mismatch between
the two as a pointer to if the PDF-file is malicious or not. This is because ’obj’ marks the
beginning of an object while ’endobj’ marks the end of an object and therefore a mismatch
between the two could mean malicious content. While still using Kittilsen’s report as a point of
reference for feature extraction, a mismatch of ’obj’ and ’endobj’ occurred as can be seen in
figure 12. This was because the author started checking feature names by only counting features
which had a block in front of them with an ASCII value for ’end-of-line’ or ’space’. Having ’>’ next
to ’endobj’ is most likely a result of a bug with LaTeX and the creation of the PDF-file, since the
PDF-file states that it was created by LaTeX. However it does not change the fact that the author’s
algorithm missed an important feature.
Out of 2719 ’obj’ and ’endobj’,as one feature name had the ASCII value of ’>’ in front of the
name. It is nonetheless important to take into account such kind of possibilities. In order to check
if it could be a potential bug with Snort and Tcpdump, the md5 hash-sum was compared with
the PDF-file downloaded through a normal web browser and the PDF-file extracted with Tcpflow.
The HTTP-header was removed and a hashing of the document was performed. The documents
where identical. At first the author thought of ignoring checking the block in front of the feature
name, but that led to a new problem as can be seen in figure 13. While the block after ’obj’
contains a ’(’ one can clearly see that substring match is not a valid feature, but just a part of
Kittilsen’s keyword for a citation in the PDF-document. The example in figure 13 also shows that
a false positive would be counted if Kittilsen used a citation keyword named ’/Page’ because the
ASCII value of ’.’ is not accounted for. The validation check do also need to take into account of
the numbers between 0 and 9. As can be discovered in figure 12 and figure 12, the feature ’obj’
is quite close to ’0’. Since ’>’ is next to ’endobj’, one could argue that at some point there will
be a number right next to a feature name. In this case, a number would indicate a valid match,
but one can’t be completely sure.
In order to check feature names without ’/’, one will have to check the following bytes:
[?][F][e][a][t][u][r][e][?]
26
Real time detection and analysis of PDF-files
Figure 13: Kittilsen’s keyword used for citation in his master thesis was found.
27
Real time detection and analysis of PDF-files
The next pseudo-code sample, in code-list 4, shows how the naive algorithm is developed. As
can be seen of the pseudo code, the naive algorithm is quite simplistic. This will actually have an
effect on the performance as will be discussed in section 5.2.
1 // Code− l i s t 4
2 // Pseudo code f o r how t h e f e a t u r e e x t r a c t i o n p r o c e s s i s performed
3
4 for ( bytes in feature length )
5 i f ( feature_name []!=PDF [ ] )
6 return 0 ;
7
8 i f ( f e a t u r e don ’ t have ’ / ’ )
9 i f ( b l o c k i n f r o n t have a−z , A−Z , 0−9, ’ . ’ )
10 return 0;
11
12 i f ( b l o c k a f t e r have a−z , A−Z , 0−9)
13 return 0;
14 return 1;
28
Real time detection and analysis of PDF-files
1 // Code− l i s t 5
2 // Pseudo code f o r how t h e f e a t u r e e x t r a c t i o n p r o c e s s i s i n i t i a t e d
3 f o r ( each f e a t u r e )
4 start_time_feature_x
5 f e a t u r e _ c h e c k ( return f r e q u e n c y o f one f e a t u r e i n t h e PDF− f i l e )
6 end_time_feature_x
The second pseudo-code example, in code-list 6, shows how the BMH algorithm have been
roughly implemented. If one compare the naive algorithm in section 4.3.2 with the example of
BMH one can quickly discover that the implementation of BMH is far more complex than the
naive counter part. A problem that may, or may not, occur is that the BMH have to account for
a lot of different if()-sentences and jumping back and forth in the code. The longer (and more
often) the program have to jump between code, the less efficient it will get. However one does
not know if this will have an effect on a PDF-file of x size and at which size a naive algorithm will
be equally efficient as the BMH algorithm. The efficiency question in regards to the size goes for
both the size of the PDF-file and the length of the feature being searched for.
1 // Code− l i s t 6
2 // Pseudo code f o r how t h e f e a t u r e e x t r a c t i o n p r o c e s s i s performed
3 f o r ( e v e r y b y t e i n PDF− f i l e )
4 i f ( l a s t b y t e does not match )
5 f o r ( a l l unique c h a r a c t e r s i n s u b s t r i n g )
6 i f ( match )
7 jump x b y t e s a c c o r d i n g t o t a b l e
8 e x i t for ()
9 i f ( no match )
10 jump s i z e o f s u b s t r i n g
11 e l s e ( l a s t b y t e do match )
12 f o r ( e v e r y b y t e i n s u b s t r i n g , backwards )
13 i f ( match )
14 move backwards one b y t e
15 e l s e ( no match )
16 f o r ( unique c h a r a c t e r s i n s u b s t r i n g )
17 i f ( match )
18 jump x b y t e s a c c o r d i n g t o t a b l e
19 e x i t for ()
20 i f ( no unique match )
21 jump s i z e o f s u b s t r i n g
22 i f ( v e r i f i e d match )
23 check b y t e i n f r o n t and a f t e r s u b s t r i n g match
24 i f ( a l l good )
25 t o t a l count o f f e a t u r e +1
26 e l s e ( not t h e s u b s t r i n g I was l o o k i n g f o r )
27 jump s i z e o f f e a t u r e l e n g t h
28 return t h e t o t a l count
As one can see of the pseudo-code of the BMH-algorithm, the author predict that the last byte
in the substring does not match a given byte in the PDF-string. This is an optimization question
where one want to organise the code where the content likely to be executed the most close
to the if()-statement, while the content going to be executed the least would be placed at the
bottom. The author notes that this isn’t a normal string search through a normal book or text a
database. The raw PDF-file contains bytes with a wider range of ASCII values than the normal
alphabet with the addition of the numbers zero to nine. Therefore the likelihood of the last byte
in the substring matching a byte in the PDF-string is smaller. If a match of the last ASCII value
29
Real time detection and analysis of PDF-files
occurs, the algorithm will go to second last character in the substring and perform a comparison.
It will continue that way unless a mismatch occurs. If a mismatch doesn’t occur, the algorithm
will jump to the validation checks. If a valid feature is found, the algorithm will count the feature
match and continue further into the PDF-file. However if a mismatch occur, the algorithm will
check if the character in the PDF-file matches a character in the substring. If a match is found,
then the algorithm will jump a predefined length. If the character does not exist in the substring,
the algorithm will jump forward the same length as the size of the substring.
Wget is a free utility for non-interactive download of files from the Web [48].
As can be seen in appendix D and noted in section 5.4, there have been malicious PDF-files in
the benign set of PDF-files Kittilsen built his SVM model on. The author does not have the time
to manually check all of the "benign" PDF-files to make sure that they are actually benign, but
it is important to note that the foundation of which one build the SVM-model should be clearly
separable i.e. no malicious files in the benign PDF-set. It is also important to note that while
JavaScript may be a strong indication of malicious PDF-file [13], it won’t do any good if none
of the malicious PDF-files contain any JavaScript. Add to the fact that half of the benign dataset
could contain JavaScript, which would cause the SVM model to become skewed.
The following features are used to build the SVM model.
• /Page
• /AcroForm
• /OpenAction and /AA
• /RichMedia:
• /Launch:
• /JavaScript and /JS
• startxref
• trailer
• The mismatch between ’obj’ and ’endobj’
The SVM-model itself is built by using Jarle Kittilsen’s settings for the most successful experiment
[2]. However LibSVM is utilised for its C-code implementation instead of using the Python
30
Real time detection and analysis of PDF-files
counterpart called PyML [49]. The parameters for LibSVM are c-svc (which is also the default
option in LibSVM), the kernel type is ’radial basis function’ (Gaussian kernel), γ is 0.1 (gamma)
and the cost is 100. The author utilised Weka [50] in order to perform the model experiments and
the results are shown in table 2 and table 3 . Appendix F shows the Weka GUI after performing a
10-fold cross-validation and by using the entire dataset for classification. The SVM-model itself,
which the author used for the classification of PDF-files, was built by LibSVM with no involvement
from Weka.
Table 3: Weka and LibSVM result by using the entire training set
31
Real time detection and analysis of PDF-files
by adding ’1’ or ’-1’ in front of the list of features in the input file. The author has defined benign
files as ’1’ while malicious files are classed as ’-1’. If one "guesses" correctly, the LibSVM will
output: Accuracy = 100% (1/1) (classification)
This means that if the author assumes all files captured are benign, then a malicious file will be
outputted as: Accuracy = 0% (0/1) (classification)
This allows the author’s program to distinguish between feedback and log the captured files in
the correct folder (benign/malicious) for post-analysis. LibSVM is suited for post-classification
because LibSVM needs to read and interpreted the SVM-model for each classification. The best
scenario would be to have LibSVM running and send a feature file from new PDF-files as they are
logged by Snort.
32
Real time detection and analysis of PDF-files
Figure 14: Brief documentation of how the author’s detection system works
33
Real time detection and analysis of PDF-files
This chapter contains result of the different tests performed ranging from extracting the PDF-file,
extracting features from a PDF-file, classification using SVM and scanning suspicious files found
during the master thesis period.
The development and testing was mainly performed on this computer set-up:
• Ubuntu 12.10 (32-bit).
• VmWare workstation 9.0.0 build-812388.
• Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz
• 2x4GB DDR3 RAM
• Seagate 750GB, Model number: ST9750420AS
A test computer provided by NISLab [51], without VmWare, had the following set-up:
• Ubuntu 12.10 (32-bit).
• Intel core: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
• 2x2GB DDR3 RAM
• Seagate 250GB, Model number ST3250310AS
This virtual computer was created in order to scan specific PDF-files with different anti-virus
software.
• MS Windows 7 Pro(32-bit).
• VmWare workstation 9.0.0 build-812388.
• Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz
• 2x4GB DDR3 RAM
• Seagate 750GB, Model number: ST9750420AS
35
Real time detection and analysis of PDF-files
The testing of the feature extraction process were performed on a large dataset of PDF-files
and the author have been using Jarle Kittilsen’s PDF-corpus of 7454 benign PDF-files and 16280
malicious PDF-files [2].
5.1.2 Results
Table 4 and 5 shows the time from the first network packet arrives, containing the PDF-footer, to
when Tcpflow have finished extracting the PDF-file. Table 4 shows the time spent for what the
author thought was the physical hard drive, while table 5 shows the time spent for a ramdisk.
The different PDF-files are:
The time is quite similar with both the ramdisk and the "hard drive". The author first thought
this was because of the network packets getting stored in the hard drive’s disk cache. A disk
36
Real time detection and analysis of PDF-files
cache is a part of a hard drive where data is stored temporarily for faster access time [55]. This is
quite useful for smaller pieces of data that is often used by the computer. However the disk cache
will have no effect when streaming large video files etc. Because the PDF-files are so small and
the Tcpflow-process’ effectiveness is dictated by the CPU, the difference in speed gain between
the disk buffer and the ramdisk could be negligible. The speed measured for both tests ranged
from 12 seconds up to 25 seconds at most. Overall the time spikes were the same for each file
regardless of storage device.
The author didn’t believe that same results would be achieved for both a hard drive and a
ramdisk. In order to make sure that the hard drive results were valid, a reading speed benchmark
of the hard drive was performed as can bee seen in figure 15. The benchmark provided an
interesting result as one can see that the maximum read speed have been measured to 2,4GB/s!
The speed is not consistent as it varies between 54Mb/s and 2,4Gb/s. The author believes this is
caused by VMware preparing the section of the hard drive the benchmark test is about to measure
and that section gets placed in RAM. This means that the benchmark test doesn’t measure the
speed against the physical hard drive, but rather the section of the hard drive that is placed
in RAM, which in turn have a much higher read speed. Another hard drive benchmark were
performed on the computer provided by NISLab and the result can be seen in figure 16. Figure
16 shows a result one would expect from a physical drive with its maximum read speed on
96,5Mb/s and an average access time on 15,1ms.
37
Real time detection and analysis of PDF-files
The author decided to perform a new test on the computer that NISLab had provided in
order to get correct measurements of a physical drive as well as the SSD performance as the
author’s laptop only had one normal hard drive at its disposal. However problems occurred
when the author tried to implement parts of the PDF-detection system on NISLab’s lab computer.
The problem is related to how Snort sends alerts to the Unix socket. The author used the same
detection rules as was developed in the virtual machine. However while Snort manages to detect
both the PDF-header and PDF-footer in the virtual machine, Snort was only able to detect the
PDF-footer in the PDF-file. The author made sure that Snort was compiled from the same source
code that was used for the virtual machine and that the additional tools like DAQ [56], libpcap
etc. had the same version number. Yet Snort refused to detect the PDF-header. The author tried
to tweak the detection rules, which led to Snort being able to detect the PDF-header. However
while Snort was able to detect the PDF-header, Snort would not be able to log the network session
that contained the PDF-file. One could either get Snort to detect the PDF-header and not log the
network session or one could keep the rule about Snort logging the network session and end up
with Snort not "detecting" the PDF-header at all. The author don’t know exactly what is causing
this problem, but decided to leave the problem be in order to focus on more important tasks.
The author’s supervisor can however attest that the PDF-detection system worked on the virtual
machine.
38
Real time detection and analysis of PDF-files
The other reason why the author chose to count if()-comparisons by adding additional else-
statements and not at not after the code executed by the if()-sentences is because the algorithm
often jumps back and forth, which is especially the case with the BMH algorithm. Instead of
counting the if()-sentences at line 12 in the pseudo-code sample, the author wanted to make
sure that the algorithm counted all of the if()-sentences correctly.
The time was measured for each individual feature and then the mean was calculated based
on how many iterations the feature was searched for. The author decided to let the algorithm
iterate the search for each individual feature ten times. The time is displayed in milliseconds and
is recorded by the ’gettimeofday()’-function. The reason for choosing the
’gettimeofday()’-function instead of the ’clock()’-function is because the author wants to
see how much time the feature extraction takes with normal processes running and not the time
isolated to one process. In order to make sure no unnecessary processes were running during
the experiments, the author restarted Ubuntu before the experiments were executed. It is also
important to note that the author performed experiments on the algorithms when they counted
if()-sentences executed and measured the time separately.
39
Real time detection and analysis of PDF-files
The first PDF-file the author want to perform the feature extraction is Jarle Kittilsen’s master
thesis report[2] and the results are shown in table 6. The different graphs displaying the results
can be found in figure 17 and figure 18.
Kittilsen’s report have a file size of 2.263.642 bytes.
A notable difference that can be spotted in table 6 is the difference between the two kinds of
features which have been extracted with the naive algorithm. Features starting with ’/’ have a
40
Real time detection and analysis of PDF-files
Figure 17: Graph showing difference between if()-sentences executed by the naive- and BMH algorithm for
the PDF-file: jarle_kittilsen.pdf 6
Figure 18: Graph showing the time difference between the naive- and BMH algorithm for the PDF-file:
jarle_kittilsen.pdf 6
41
Real time detection and analysis of PDF-files
similar if()-sentence execution to the size of the PDF-file in question, while the none ’/’ features
have a huge increase in the amount of if()-sentence executed and the time measured. The
difference is about three times which is quite a significant difference compared to the Boyer-
Moore-Horspool(BMH) algorithm as the results are quite similar across the board. The author
believes the reason for a such significant difference is because the ’/’ is quite rare in the PDF-file
and that characters like ’o’ and ’t’ occurs more frequently. However the results within the same
kind of feature is quite similar, meaning ’/Page’, ’/AcroForm’ and ’/AA’ have roughly the same
amount of if()-sentences executed. The length of the feature name in general seems to not cause
any significant difference when performing a naive search. The author believe this is because the
naive algorithm is matching one and one byte from left to right, where as the BMH algorithm
benefits from the longer the substring to be matched is. Though the BMH algorithm have a
higher variance for if()-sentences executed, in return it also have executed a significant lesser
amount of if()-sentences than its naive counterpart. The BMH time for each feature is similar
across the board, while the naive algorithm experience a huge spike for ’obj’, ’trailer’ and
similar named features by about three times then features starting with a ’/’. The reason why
the naive algorithm spends a total of 113 milliseconds while BMH only spends 83 milliseconds
is because the author did not list all of the features in the table. The author notes that there are
an additional two non ’/’ features which is causing the naive algorithm to be a lot slower.
The next PDF-file is "PDF32000_2008.pdf" from Adobe’s PDF specification page [54] and the
results are shown in table 7. The different graphs showing the results can be found in figure 19
and figure 20. The file size of the PDF-file is 8.995.189 bytes.
As shown in table 7 the result quite look similar with the previous PDF-file in table 6. The
amount of if()-sentences between features which start with and without ’/’ are about three times
when performing a naive search and the total amount of if()-sentences executed between the
naive and the BMH algorithm is about five times. This is roughly the same result as can be
seen in table 6. An interesting thing to point out is that some of the features are more time
exhaustive when searched by the BMH algorithm than if the naive-algorithm is used. This is
the case for features like ’/Page’, ’/AA’ and ’/AcroForm’. This is strange because BMH clearly
executes fewer if()-sentences than the naive algorithm. The author believe the reason why the
naive algorithm performs better in some cases is because the BMH algorithm is quite complex.
At one point the BMH algorithm have a "tree" of six for-loops/if()-sentences as can be seen in
the pseudo-code in section 4.3.3. When an algorithm contains a lot of for-loops and if-sentences
42
Real time detection and analysis of PDF-files
(i.e. becomes more complex) like the BMH implementation, it causes the program to jump a
lot between code-lines. This is one of the reasons for why one should always organise the if()-
segments like the pseudo-code sample "Tips to organise if()-statements":
1 // T i p s t o o r g a n i s e i f ( )−s t a t e m e n t s
2 i f ( argument match )
3 { T h i s c o n t e n t w i l l be e x e c u t e d t h e most and i s t h e r e f o r e p l a c e d here . }
4 e l s e i f ( argument match )
5 { T h i s c o n t e n t i s not e x e c u t e d t h a t o f t e n and should t h e r e f o r e be p l a c e d here . }
6 else
7 { T h i s c o n t e n t i s e x e c u t e d t h e l e a s t and should t h e r e f o r e be p l a c e d here . }
The author notes that the amount of code to be executed if an if()-statement match can also be
taken into consideration when organising the order of the if()-statements.
Figure 19: Graph showing difference between if()-sentences executed by the naive- and BMH algorithm for
the PDF-file: PDF32000_2008.pdf 7
The complexity of the implemented algorithm explains why the BMH algorithm may be more
time consuming process than the naive algorithm. However because some of the features do not
contain a ’/’ in the beginning of the feature name, the BMH algorithm will still be more efficient
overall as the total time difference between the two algorithms are 99 milliseconds in table 7.
This is backed up by the fact that for both the total result of if-sentence executed in table 6 and
table 7, the difference between BMH and the naive algorithm are about 4,5 times. Yet the naive
algorithm only spends about 1,2 times the time then what the BMH algorithm does.
43
Real time detection and analysis of PDF-files
Figure 20: Graph showing the time difference between the naive- and BMH algorithm for the PDF-file:
PDF32000_2008.pdf 7
The last PDF-file is the Adobe’s reference guide for the PDF-format, named "pdf_reference_1-
7.pdf", and contain 32.472.771 bytes [8]. The results can be found in table 8. The different
graphs displaying the results can be found in figure 21 and figure 22.
44
Real time detection and analysis of PDF-files
The results in table 8 reflects the tendency from the previous PDF-files:
• Total if()-sentences executed for the naive algorithm is about 4,5 times as many as the BMH
algorithm and yet the naive algorithm is only 1,2 times slower than BMH.
• The amount of if()-sentences executed are stable throughout the naive feature search, while
BMH have a greater variance which is affected by the length of the feature name.
• The final result of the naive-algorithm is affected by how many features start with ’/’ and non
’/’, while the BMH algorithm is only affected by the length of the feature name in general.
• For feature names with ’/’, the naive algorithm is currently more efficient than the BMH
algorithm.
Figure 21: Graph showing difference between if()-sentences executed by the naive- and BMH algorithm for
the PDF-file: pdf_reference_1-7.pdf 8
The author is using the term "efficiency" about how much time an algorithm is using from the
start until every feature has been counted successfully. One could argue that the naive algorithm
is more efficient than the BMH algorithm because the naive algorithm can process more if-
sentences per time unit than the BMH algorithm. This is backed up by the fact that if it weren’t
for the features without ’/’, the naive algorithm would potentially have been the best algorithm
of the two. The author notes that the even though the naive algorithm have beaten the BMH
algorithm in some cases, the author would recommend trying other searching algorithms like
KMP in order to see if the naive algorithm is still the better option due to its simplicity. Scientific
papers regarding new and improved searching algorithms are often used to scan through normal
45
Real time detection and analysis of PDF-files
text and/or databases. The PDF-file is a different scenario because it contains extra conditions
like:
• Validation checks to validate a feature.
• ’/’ in the beginning of the feature name.
• Bytes with a larger range of values than the standard a-z, A-Z and 0-9 (i.e. a larger distribution
of ASCII values which means normal characters like ’s’ and ’e’ occur less frequently than in a
text file with the same size of the PDF-file.
Figure 22: Graph showing the time difference between the naive- and BMH algorithm for the PDF-file:
pdf_reference_1-7.pdf 8
The last table, which is table 9, shows the difference between the total amount of time spent
for the naive algorithm, the BMH algorithm and Jarle Kittilsen’s python script. Kittilsen’s result is
gathered from appendix C. The difference is quite significant were Kittilisen’s script uses roughly
73 seconds in order to extract features from his own thesis report. There is however a good
reason as to why Kittilsen’s Python script is so slow. Kittilsen is parsing the PDF-document and
then counting the feature matches, while the author is only performing a normal string search.
Another question that arise is why a 2,3MB PDF-file uses significantly more time to process than
a PDF-file of 171MB as can be seen in table 9. The 171MB PDF-file was given to the author
by student at Gjøvik University College and it is a PDF-file consisting of scanned papers with
notes and other things. The PDF-file basically only consist of scanned images which makes sense
performance wise. The reason being that Kittilsen’s master thesis report have a more complex
46
Real time detection and analysis of PDF-files
structure than the PDF-file consisting of images only. The master thesis report contains citations,
bibliography, table/figure references etc. It makes sense that a PDF-file with a complex document
structure is more time exhaustive to parse than parsing a file only containing images. The time
spent where only a normal string search is performed, are roughly linear in regards to the size of
the PDF-file. The author notes that the large variation with Kittilsen’s Python script is not a result
of having other CPU intensive processes running in the background. The processing times have
also been consistent over ten times.
Table 9: The time, in milliseconds, spent to extract features with the naive-, BMH algorithm and Kittilsen’s
Python script
5.3.2 Results
This procedure have been tested on both the set of benign PDF-files as well as the malicious
ones. The time spent from starting the classifying process have been consistent on both sets.
As can be seen in table 10, the time for classifying a PDF-file is 6 milliseconds. The classifying
time have been consistent for all of the 23.734 PDF-files, which is logical because the time spent
on classifying a PDF-file is not dependent on the size of the PDF-file or how fast the feature
extraction process is. The data LibSVM have to process is quite similar every time it performs a
classification, where the variance is only a about five to ten bytes depending on how many digits
the different features have to represented by. If one wanted to save additional time during the
classification process, LibSVM would have to pre-process the SVM-model and then go in a "loop"
to accept the PDF-feature file and then perform the classification. However a constant time on 6
47
Real time detection and analysis of PDF-files
milliseconds is not bad either because there are other and more time exhaustive processes that
needs more attention which is the feature extraction process.
Screen shots of the PDF-files with suspicious content can be found in appendix D. Since Kittilsen
scanned these files with different anti-virus solutions in 2011, the author decided to scan the
four PDF-files again with updated virus definitions (1/5-13). The author used the same three
anti-virus solutions Kittilsen used in 2011, with the addition of two new ones. The following
anti-virus solutions were used:
• MS security essentials [60]
• Trend Micro [61]
• AVG [62]
• Norton Internet Security [63]
• Norman [64]
The result of the scan is shown in table 11. The author got quite surprised by the result, but
understand that anti-virus software have to catch several other types of files and attacks which
48
Real time detection and analysis of PDF-files
means some malicious files of a given file type will slip through the cracks. As the white paper
from Symantec explained [45], anti-virus solution may experience performance issues if they
only scan the first four bytes in order to find a file-header. The author can however not with
certainty state that this is the case with these four PDF-files. The fact remains that the Adobe
specification do allow for related or unrelated bytes to be in front of the PDF-header as can
clearly be seen in figure 7 in section 4.1 where the author was able to open the PDF-file in a
PDF-reader despite the HTTP-header information.
The author notes these four PDF-files were also uploaded to Jsunpack’s website [65]. No
malicious PDF-files were detected and the screen shots of the results from Jsunpack’s website
can found in appendix E.
49
Real time detection and analysis of PDF-files
6 General Discussion
This chapter will discuss both theoretical implications, practical considerations, provide a conclusion
and discuss possibilities for further work.
51
Real time detection and analysis of PDF-files
52
Real time detection and analysis of PDF-files
53
Real time detection and analysis of PDF-files
Figure 23: Brief documentation of how the author intended the detection system to work
54
Real time detection and analysis of PDF-files
Kittilsen Borg
Scope Offline detection Online detection
Code Python C
Time Worst case: Size/complexity dependent Worst case: Size dependent
Focus Detection accuracy Speed efficiency
Testing Different machine learning algorithms Storage mediums and string
search algorithms.
Table 12: The main differences between Kittilsen’s master thesis [2] and the author’s master thesis.
55
Real time detection and analysis of PDF-files
6.5 Conclusion
An online implementation looks feasible and possible to develop, but can be quite hard to
implement perfectly with the current tools utilised in this thesis. One of the main issues that
makes it difficult to implement the system is that the there is no reliable way to know exactly
when every network packet of the PDF-file have been logged. The author tried to use ’%%EOF’ as
an indicator when reaching the end of the PDF-file, but problems occur when the PDF-file have
several ’%%EOF’. One of the them might even be placed within the first 1000 bytes of the PDF-file
which means that the author’s software may force Tcpflow to extract a PDF-file that are being
logged by Snort. There are possibilities to circumvent this issue, but that requires the system
to be divided into several subsystem/processes taking care of one specific PDF-transfer event. A
couple of examples can for instance be a user who is uploading a PDF-file to his local storage
space on a server or a user downloading a PDF-file from the internet with a web browser.
This problem leads to the author’s first research question which is the question about if a
online PDF-file detection system is viable? The answer is that the detection system in its current
form should not to be implemented in a real environment. There are to many faults present
which include the limitations of the control Snort have, the way the PDF-file is constructed by
having an end-of-file marker almost "anywhere" in the document and that buffering all of the
traffic in a network is a bad idea as some of the traffic can’t be delayed. Countering these issues
requires narrowing the area of usage to several specialised systems and the analysis of these
systems are outside the scope of this master thesis. In regards to the time spent on analysing a
file one can look at the results in table 9 in section 5.2.2. These results show that by using the
Boyer-Moore-Horspool algorithm one can achieve a linear time "usage" in regards to the size of
the PDF-file where the longest measured time was 6,7 seconds for the largest PDF-file on 171MB.
For a file of 171MB in size, spending only 6,7 seconds to extract all of the features on a single
process thread is not bad. While other scientific papers claim to have a 10% increased efficiency
algorithm [35], that would only account for roughly 670 milliseconds of a total of 6,7 seconds.
The second research question asked if there was any significant difference in time between
Jarle Kittilsen’s feature extraction written in Python and the authors version written in C. There
was a significant difference in time usage as can be seen in table 9 in section 5.2.2. However since
Kittilisen’s solution is parsing the PDF-document instead of performing a string search, complex
documents like master thesis reports require more processing time than simplistic documents.
As described in section 5.2.2, the time spent parsing the document is not linear in regards to the
size of the document. An example is the comparison between the time spent on a 171MB PDF-file
containing images only (51 seconds) versus a complex master thesis report on only 2,3MB (73
seconds).
56
Real time detection and analysis of PDF-files
The author believes it is quite difficult to develop a system that can analyse PDF-files and
prevent the files from reaching the user by only listening to network traffic. Other tools needs to
be in place in order to make sure that all of the PDF-file’s bytes are available for analysis. To be
able to add helpful tools, one will have to narrow down the system’s scope in order to focus on
a specific way of how PDF-files are transported through the network by for instance focusing on
people downloading PDF-files through a web browser.
57
Real time detection and analysis of PDF-files
Bibliography
[1] Johnsrud, I., Ege, R. T., Henden, H., & Johnsen, A. B. Her er virus-e-posten
som angrep forsvaret. http://www.vg.no/nyheter/innenriks/artikkel.php?
artid=10093859. (Last Visitied 21/12-12).
[4] Laskov, P. & Šrndić, N. 2011. Static detection of malicious javascript-bearing pdf
documents. In Proceedings of the 27th Annual Computer Security Applications Conference,
ACSAC ’11, 373–382. ACM.
[5] Likarish, P., Jung, E., & Jo, I. 2009. Obfuscated malicious javascript detection using
classification techniques. In Malicious and Unwanted Software (MALWARE), 2009 4th
International Conference on, 47–54.
[7] Ben-Hur, A. & Weston, J. 2010. A user’s guide to support vector machines. In Data Mining
Techniques for the Life Sciences, Carugo, O. & Eisenhaber, F., eds, volume 609 of Methods in
Molecular Biology, 223–239. Humana Press.
[10] Symantec. 2012. Internet security threat report 2011 trends. http://www.symantec.
com/content/en/us/enterprise/other_resources/b-istr_main_report_
2011_21239364.en-us.pdf. (Last Visitied 21/12-12).
59
Real time detection and analysis of PDF-files
[13] Baccas, P. 2010. Finding rules for heuristic detection of malicious pdfs: With analysis of
embedded exploit code. (Last Visitied 21/12-12).
[18] Schmitt, F., Gassen, J., & Gerhards-Padilla, E. 2012. Pdf scrutinizer: Detecting javascript-
based attacks in pdf documents. In Privacy, Security and Trust (PST), 2012 Tenth Annual
International Conference on, 104–111.
[26] Chacos, B. 2012. How to supercharge your pc with a ram disk. http:
//www.pcworld.com/article/260918/how_to_supercharge_your_pc_with_
a_ram_disk.html. (Last Visitied 1/5-13).
60
Real time detection and analysis of PDF-files
[30] Karp, R. M. & Rabin, M. 1987. Efficient randomized pattern-matching algorithms. IBM
Journal of Research and Development, 31(2), 249–260.
[31] Kustanto, C. & Liem, I. 2009. Automatic source code plagiarism detection. In Software
Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, 2009.
SNPD ’09. 10th ACIS International Conference on, 481–486.
[32] Horspool, R. N. 1980. Practical fast searching in strings. Software Practice and Experience,
10, 501–506.
[33] Xiong, Z. 2010. A composite boyer-moore algorithm for the string matching problem.
In Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010
International Conference on, 492–496.
[34] Knuth, D. E., Morris, J. H., & Pratt, V. R. 1974. Fast pattern matching in strings.
[35] Sunday, D. M. August 1990. A very fast substring search algorithm. Commun. ACM, 33(8),
132–142.
[39] Chang, C.-C. & Lin, C.-J. Libsvm – a library for support vector machines. www.csie.ntu.
edu.tw/~cjlin/libsvm/. (Last visitied 3/5-13).
61
Real time detection and analysis of PDF-files
[58] Gprof tutorial – how to use linux gnu gcc profiling tool. http://www.thegeekstuff.
com/2012/08/gprof-tutorial/l. (Last visitied 20/5-13).
[59] Adobe supplement to the iso 32000 baseversion: 1.7 extensionlevel: 3. http:
//www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_
supplement_iso32000.pdf. (Last visitied 4/5-13).
62
Real time detection and analysis of PDF-files
63
Real time detection and analysis of PDF-files
65
Real time detection and analysis of PDF-files
66
Real time detection and analysis of PDF-files
67
Real time detection and analysis of PDF-files
Date results for Naive algorithm: [Feature name]: ’[Feature count]’,’[If()-count]’,’[Time in ms]’
1 //−−−−−−−−−− I f ( )−s e n t e n c e s counted , d i s r e g a r d t h e time i n t h i s s e c t i o n −−−−−−−−−−
2 Filename : j a r l e . pdf
3 / Page : ’ 132 ’ , ’ 2303711 ’ , ’ 7 ’ / AcroForm : ’ 0 ’ , ’ 2303326 ’ , ’ 7 ’
4 / OpenAction : ’ 1 ’ , ’ 2302567 ’ , ’ 7 ’ /AA : ’ 0 ’ , ’ 2303326 ’ , ’ 7 ’
5 / JS : ’ 0 ’ , ’ 2302544 ’ , ’ 7 ’ / J a v a S c r i p t : ’ 0 ’ , ’ 2302544 ’ , ’ 7 ’
6 / RichMedia : ’ 0 ’ , ’ 2302986 ’ , ’ 7 ’ / Launch : ’ 0 ’ , ’ 2303497 ’ , ’ 7 ’
7 s t a r t x r e f : ’ 1 ’ , ’ 6804369 ’ , ’ 20 ’ t r a i l e r : ’ 1 ’ , ’ 6805628 ’ , ’ 21 ’
8 o b j / end : ’ 2719 ’ , ’ 2719 ’ | ’ 0 ’ | , ’ 6823848 ’ , ’ 6827764 ’ | ’ 20 ’ , ’ 21 ’ ,
9 T o t a l i f _ c :45686110 , T o t a l time f e a t u r e (ms) : 1 3 8 , S i z e :2263642
10
11 Filename : p d f _ r e f e r e n c e _ 1 −7. pdf
12 / Page : ’ 1310 ’ , ’ 33798997 ’ , ’ 100 ’ / AcroForm : ’ 1 ’ , ’ 33854245 ’ , ’ 98 ’
13 / OpenAction : ’ 0 ’ , ’ 33789632 ’ , ’ 100 ’ /AA : ’ 17 ’ , ’ 33854643 ’ , ’ 100 ’
14 / JS : ’ 0 ’ , ’ 33787440 ’ , ’ 100 ’ / J a v a S c r i p t : ’ 0 ’ , ’ 33787444 ’ , ’ 100 ’
15 / RichMedia : ’ 0 ’ , ’ 33822249 ’ , ’ 100 ’ / Launch : ’ 0 ’ , ’ 33825607 ’ , ’ 100 ’
16 s t a r t x r e f : ’ 2 ’ , ’ 97548474 ’ , ’ 301 ’ t r a i l e r : ’ 7 ’ , ’ 97805684 ’ , ’ 296 ’
17 o b j / end : ’ 110776 ’ , ’ 110776 ’ | ’ 0 ’ | , ’ 98752653 ’ , ’ 98686249 ’ | ’ 303 ’ , ’ 303 ’ ,
18 T o t a l i f _ c :663313317 , T o t a l time f e a t u r e (ms) :2001 , S i z e :32472771
19
20 Filename : S c i e n t i f i c M e t h o d o l o g y . pdf
21 / Page : ’ 299 ’ , ’ 172774128 ’ , ’ 505 ’ / AcroForm : ’ 0 ’ , ’ 172773931 ’ , ’ 503 ’
22 / OpenAction : ’ 0 ’ , ’ 172772782 ’ , ’ 506 ’ /AA : ’ 58 ’ , ’ 172774022 ’ , ’ 508 ’
23 / JS : ’ 1 ’ , ’ 172771412 ’ , ’ 518 ’ / J a v a S c r i p t : ’ 0 ’ , ’ 172771415 ’ , ’ 534 ’
24 / RichMedia : ’ 0 ’ , ’ 172773203 ’ , ’ 512 ’ / Launch : ’ 0 ’ , ’ 172772649 ’ , ’ 506 ’
25 s t a r t x r e f : ’ 2 ’ , ’ 514704628 ’ , ’ 1546 ’ t r a i l e r : ’ 7 ’ , ’ 514770616 ’ , ’ 1556 ’
26 o b j / end : ’ 110776 ’ , ’ 110776 ’ | ’ 0 ’ | , ’ 515103548 ’ , ’ 514676773 ’ | ’ 1560 ’ , ’ 1554 ’ ,
27 T o t a l i f _ c :3441439107 , T o t a l time f e a t u r e (ms) :10308 , S i z e :171372579
28
29 Filename : adobe_supplement_iso32000 . pdf
30 / Page : ’ 140 ’ , ’ 1429782 ’ , ’ 4 ’ / AcroForm : ’ 0 ’ , ’ 1429508 ’ , ’ 4 ’
31 / OpenAction : ’ 2 ’ , ’ 1428100 ’ , ’ 5 ’ /AA : ’ 0 ’ , ’ 1429516 ’ , ’ 4 ’
32 / JS : ’ 0 ’ , ’ 1428049 ’ , ’ 4 ’ / J a v a S c r i p t : ’ 0 ’ , ’ 1428050 ’ , ’ 5 ’
33 / RichMedia : ’ 0 ’ , ’ 1429521 ’ , ’ 4 ’ / Launch : ’ 0 ’ , ’ 1429654 ’ , ’ 4 ’
34 s t a r t x r e f : ’ 4 ’ , ’ 4129906 ’ , ’ 12 ’ t r a i l e r : ’ 4 ’ , ’ 4135991 ’ , ’ 12 ’
35 o b j / end : ’ 5573 ’ , ’ 5573 ’ | ’ 0 ’ | , ’ 4185523 ’ , ’ 4184209 ’ | ’ 13 ’ , ’ 13 ’ ,
36 T o t a l i f _ c :28067809 , T o t a l time f e a t u r e (ms) : 8 4 , S i z e :1373256
37
38 Filename : PDF32000_2008 . pdf
39 / Page : ’ 756 ’ , ’ 9218587 ’ , ’ 28 ’ / AcroForm : ’ 0 ’ , ’ 9218579 ’ , ’ 28 ’
40 / OpenAction : ’ 1 ’ , ’ 9215120 ’ , ’ 28 ’ /AA : ’ 17 ’ , ’ 9218945 ’ , ’ 27 ’
41 / JS : ’ 0 ’ , ’ 9213927 ’ , ’ 30 ’ / J a v a S c r i p t : ’ 0 ’ , ’ 9213928 ’ , ’ 28 ’
42 / RichMedia : ’ 0 ’ , ’ 9215423 ’ , ’ 28 ’ / Launch : ’ 0 ’ , ’ 9216686 ’ , ’ 27 ’
43 s t a r t x r e f : ’ 2 ’ , ’ 27055651 ’ , ’ 82 ’ t r a i l e r : ’ 0 ’ , ’ 27074868 ’ , ’ 82 ’
44 o b j / end : ’ 3325 ’ , ’ 3325 ’ | ’ 0 ’ | , ’ 27078928 ’ , ’ 27111491 ’ | ’ 82 ’ , ’ 82 ’ ,
45 T o t a l i f _ c :182052133 , T o t a l time f e a t u r e (ms) : 5 5 2 , S i z e :8995189
46
47
48
49
50
51
52
69
Real time detection and analysis of PDF-files
70
Real time detection and analysis of PDF-files
71
Real time detection and analysis of PDF-files
73
Real time detection and analysis of PDF-files
75
Real time detection and analysis of PDF-files
76
Real time detection and analysis of PDF-files
77
Real time detection and analysis of PDF-files
78
83
Real time detection and analysis of PDF-files
84