0% found this document useful (0 votes)
3 views

Carving of the OOXML document from volatile memory using unsupervised learning techniques

The paper discusses a novel technique for carving Microsoft Open Office XML (OOXML) documents from volatile memory using unsupervised learning methods such as K-means and Hierarchical clustering. The research highlights the challenges of retrieving fragmented files from RAM and presents a method that improves the efficiency of digital forensic investigations by reassembling scattered document pieces. Results indicate that the proposed technique significantly aids digital investigators in quickly locating and reconstructing OOXML documents from memory dumps.

Uploaded by

slashnata07
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Carving of the OOXML document from volatile memory using unsupervised learning techniques

The paper discusses a novel technique for carving Microsoft Open Office XML (OOXML) documents from volatile memory using unsupervised learning methods such as K-means and Hierarchical clustering. The research highlights the challenges of retrieving fragmented files from RAM and presents a method that improves the efficiency of digital forensic investigations by reassembling scattered document pieces. Results indicate that the proposed technique significantly aids digital investigators in quickly locating and reconstructing OOXML documents from memory dumps.

Uploaded by

slashnata07
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Journal of Information Security and Applications 65 (2022) 103096

Contents lists available at ScienceDirect

Journal of Information Security and Applications


journal homepage: www.elsevier.com/locate/jisa

Carving of the OOXML document from volatile memory using unsupervised


learning techniques
Noor Ul Ain Ali, Waseem Iqbal ∗, Hammad Afzal
National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan

ARTICLE INFO ABSTRACT

Keywords: With the drastic rise in digital crimes, emphasis is laid on the digital forensics to solve crimes. Evidences
Digital forensics that are being extracted from the digital devices are now considered enough to prove a criminal guilty in the
File carving court room. While gathering the proofs from a system, a significant venue to look at is the primary memory
Open office XML file format
(commonly named as RAM). A RAM holds the information about the current state of the system. It often
Clustering
occurs that the metadata of a file is deleted or is not available to the digital investigator. In such cases, the
retrieval of the file from RAM becomes very challenging as data is placed at random locations in RAM and
their allocation tables are not valid if power supply to the system is removed. While carving these instances
that are randomly scattered across the RAM, their usefulness in law process cannot be ensured. Microsoft Open
Office XML file Format (OOXML FF) is one of the most widely used format, yet it is not much explored in
forensics (and carving). Our research intends to improve the technique of carving of an OOXML file where
we have employed clustering to collect the chunks of same data based on some similarity feature. Numerous
OOXML files are used in our experiments, where we have extracted and rearranged their textual contents
using clustering techniques i.e. K mean and Hierarchical clustering. The results are quite encouraging and
show that our proposed method can be used for carving of OOXML format. Our technique of extraction of
OOXML document from the RAM reduces the hassle and saves ample time of the digital investigator who
would otherwise have to go through every document available on the system to find the concerned document.

1. Introduction using such information. These signatures are readily available and can
be matched to the ones present in disk or drive to know what file
The information required to extract the complete data from the is present in the system. Carving plays its role when these signatures
system is generally present in the metadata of a file. However, there cannot be determined and raw bytes are examined to determine the
are cases in which the metadata of a file is lost or corrupted, so files type of file. Carving would have been very easy if a file gets stored
are extracted using a special technique known as carving [1], [2]. at consecutive parts of the memory, whereas in real life scenarios, a
In digital forensics, file carving is defined as retrieving the file from single file gets stored at multiple locations that are randomly placed
system with no knowledge regarding file system or no metadata information across the whole memory structure [4]. Hence carving becomes a very
of the file [3]. File carving is performed to recover the files from the tedious task when it comes to the recovery of fragmented files.
unallocated space present in a drive. Unallocated space means the parts All file systems have a fixed structures like JPEG, PNG, PPT etc.
of the drive that no longer hold any useful information regarding the
During carving, these structures can be used to determine the signatures
file. In other words, we can say that file systems do not completely
or header or footer [5]. Among the readily available file formats,
remove the data when it is deleted. Instead, they just simply remove
recovering the lost Microsoft Word Document is very complex, because
the knowledge of its location in the drive. Hence, file carving involves
its Open Office XML file format is a series of compressed files and
scanning the raw bytes of data in the disk and its reassembly to
folders. This arrangement into multiple files and folders gives the user
construct the lost file. This is usually done by finding and examining
the first few bytes known as header and the last few bytes known as more control over the document [6] but at the same time it makes the
footer of a file. internal structure of the OOXML file format quite complex and variable.
Each type of file format has a specific signature or some particular In order to resolve the issue of carving the OOXML word document,
bytes that are fixed and the file system can determine the type of files we first studied the structure of the OOXML and determined its header

∗ Corresponding author.
E-mail address: [email protected] (W. Iqbal).

https://doi.org/10.1016/j.jisa.2021.103096

Available online 15 January 2022


2214-2126/© 2021 Elsevier Ltd. All rights reserved.
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

along with other important components. We further extracted these 2.2. OOXML file format
components from the RAW memory and extracted the text from these
components. Finally by using the clustering techniques we aligned the Open office XML file format is used for the newer versions of
retrieved texts to their respective files, thus reassembled the document. Microsoft word documents. It consists of zipped files, folders and sub
The major research contributions of this paper are: folder. If you change the extension of a .docx to .zip you will observe
a series of compressed XML folders shown in 1. If you open these files
• The paper presents a novel technique of carving of Microsoft and folders you will find the characteristics of a word file for example,
Word Documents from raw dump of RAM.
if we start with the [𝐶𝑜𝑛𝑡𝑒𝑛𝑡𝑡 𝑦𝑝𝑒].𝑥𝑚𝑙 file it will tells us what type of
• The proposed method uses various feature extraction method content will be present in our document like word, power point or excel
along with unsupervised learning techniques (K-Means, Hierar-
data. In particular,look for the ⟨𝑂𝑣𝑒𝑟𝑟𝑖𝑑𝑒𝑃 𝑎𝑟𝑡𝑁𝑎𝑚𝑒 = 𝑡𝑎𝑔⟩, where you
chical and Mean Shift Clustering) to identify the scattered parts
will find word, ppt, or xls, respectively [7]. Similarly looking at the
of documents and reassemble them.
other folders like the 𝑟𝑒𝑙𝑠. folder it contains a file known as .𝑟𝑒𝑙𝑠 file,
• The technique is applied on raw bytes extracted from RAM dump
this file basically contains all the relationship between every part of
which are decoded and XML components are identified which are
the document. Then we have a folder known as [𝑑𝑜𝑐𝑃 𝑟𝑜𝑝𝑠] this folder
then used for carving of scattered pieces of text belonging to same
has two file app and core .xml and they contain properties of the file.
files.
Then we have a word folder and inside this folder we have multiple
The remaining paper is summarized as follows: Section 2 provides files that basically determine the structure of the word file. Inside the
the background; Section 3 provides the related work; Section 4 provides word folder we have our document.xml file. The content of the whole
the detail of proposed carving method. Section 5 shows the experiments document is present in this file. The remaining components like styles,
designed and their results; Section 6 provides the analysis of the table, theme etc just deal with the outlook of the document. Fig. 1
experiments and finally Section 7 provides the conclusion and future shows the hierarchical contents of an OOXML file [8].
work.
2.3. Importance of OOXML file forensics
2. Preliminary concepts
OOXML files contain a lot of information that can be extremely help-
In simple terms carving just means extracting of data from the ful in case of any digital forensics. Some of the hypothetical scenarios
memory. The initial file carvers made just utilized the header and footer are described below that tend to use the OOXML identifiers in solving
of a file, identified the unique signature using these header/footer and important cases:
just extracted whatever was present between these header or footer.
This method worked well for smaller sized files and files with simple 2.3.1. Creator of the document
structure. But when it came to complex structures like Microsoft Open DocProps/core.xml has a creator identity which stores the name of
Office File Format simple extraction did not work. The parts of the file the operating system on which it was originally created also includes
were first extracted then in order to put them together clustering of
path of the image sources. This can be used to track and identify the
similar parts needed to be done.
original creator of a document [9], [10], [11].
This section briefly describes the clustering and its types, the num-
Scenario:
ber of steps that are involved in performing the pre processing of the
A bomb threat has been sent to a company via email as a pdf and
data so that we can effectively cluster it and finally a brief description
a picture of a map was attached.
of the structure of the Open Office XML file format.
Forensics Investigation:
So Clustering is a process to collect same objects into clusters. This
Investigators tracked down the IP address of the email and they
type of organization of similar objects into single entity is called a
narrowed down the location to a few houses. The houses were raided
Clusters.
and the electronic devices were confiscated. Upon going through the
2.1. Clustering and its types RAM of all the devices, they found that a document .docx file was
deleted recently on one of the suspected computers. The document
Clustering is a technique in which we classify different data points was extracted and the packages were retrieved and the path of the
into groups that exhibit similar characteristics. It is a form of unsuper- image was found on one of the computers, strongly indicating that this
vised machine learning. Following are basically the most well known suspect’s computer was involved.
types of clustering:
2.3.2. Thumbnail
• K Mean Clustering or partitioning: We have to determine the When a word document is saved a thumbnail for it is saved by
number of clusters initially and is a very fast paced algorithm. default. If the original document was destroyed but its thumbnail is still
• Mean Shift Clustering:It is a centroid based and it tries to find the fine, it can give us information about the contents of that document.
center of each cluster. Now if we have a document whose actual contents are different than
• Density-Based Spatial Clustering of Applications with Noise (DB- the thumbnail associated with the document then it means that either
SCAN): It starts with a random starting point and its neighbor is the contents of the document or the thumbnail is altered this can be
determined using epslon distance. It can identify noise as well. used for checking the authenticity of a word document.
• Expectation Maximization (EM) Clustering using Gaussian Mix- Scenario:
ture Models (GMM): The points are distributed in a guassian
A huge drug dealer was tipped off by someone and his place was
field and we need both mean and standard deviation to fond the
raided and all his digital devices were taken by the forensics team.
clusters in this case.
Forensics investigation:
• Agglomerative Hierarchical Clustering: There is no need to specify
The investigators start by looking at all the available data, they take
number of clusters initially. Also this algorithm is insensitive to
a RAM dump and in the dump they find a file named debt.docx but that
the choice of distance metric.
file is partly overwritten and is not working, the investigators then find
In our research we have used the K mean and Hierarchical Clustering. a thumbnail of that file and they were able to retrieve the first two
In Partitioning or K Mean the number of clusters need to be specified pages of the file that contains names, contact info and account details
initially and we do not have to do the same in Hierarchical Clustering. of potential customers.

2
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Fig. 1. Structure of an OOXML File.

2.3.3. Timestamps rsidRoot value of all the OOXML documents and compared them with
Also docProps/core.xml has a Total Time identifier that stores the the rsidRoot value of the document found from the print cache. The
recording time while editing the document, if the record time has a value of one document matched with the rsidRoot of the document
low value and the contents of the document are large then it shows that from printer hence proving that this document was made on the same
many parts of the document were made elsewhere hence indicating that machine or simultaneous. The owner of the machine was questioned
there has been a theft of intellectual property. and he finally accepted his misconduct.
Also one of the main aspect of the digital investigation is to make Following are some real life examples that determine the impor-
sure that at what time a certain incident has taken place. By inspecting tance of OOXML forensics.
the digital evidences and the physical occurrences of an investigation, Electronic document evidence can support your case and can prove
an investigator can have a time line of events that could be helpful to be a legitimate evidence in the court room. The case of Dennis Rader
(2005) is one such example [13]. He was on the loose for thirty years
in his investigation. For this purpose the time stamps of the OOXML
after committing 10 murders over the period of 30 years and he left no
document can come in handy. The docProps/core.xml has a modified
clue for the police. He used to send notes to the police using the victim
identifier that records the time stamp for last modification made to the
driver license, ids but the police could not trace him. Until 30 years
document.
later he decided to send a letter to the police using a floppy disk.
Scenario: The police forensically examined the Word document and found that
A man was prime suspect in a murder but he claimed he was the last modified author was Dennis and it traced to Lutheran Church,
innocent and he was writing a letter to a friend at the time of the where Dennis was a Deacon. This evidence put Dennis behind bars after
homicide and could not be present at the murder location. 30 years.
Forensics investigation: Also in cases one needs to track the source of the document.
The investigators took hold of his digital devices and located the One such case in the 22nd July 2011 case in Oslo [14] in which
said document, they then examined the total time identifier and time the terrorists circulated an attack manual. The forensics investigators
stamps associated with that particular document, they found that the performed analysis on that document and found important information
suspect was actually modifying that document at the time of the murder like metadata. They also used the revision identifiers to check how
and could not be present at the murder location. Hence the man was many times the document was edited and found that the document was
proved not guilty. edited for a period of almost 4 years. The further found the different
sources on which the document was modified and traced one source
2.3.4. Revision identifiers and solved the case.
OOXML contains identifiers that determine the no of revisions made Also with the increase in the digital documentation and the decrease
to a document. One such entity is the rsidRoot entity which is found in the cost of the digital storage it was estimated that 93 percent of
in the word/settings.xml. The purpose of this entity is to tell which the documentation is being done electronically. So for when trying to
documents belong to the same root document I.e. are revisions of the retrieve information from this digital archive, for any legal proceed-
same document or we can say were created simultaneously. If meta data ing can be very time consuming and daunting. Hence by using our
of a file is destroyed then my comparing the rsidRoot value we can find technique the chances of finding potentially relevant material is very
high.
which two documents were made using the same machine [12].
Scenario: 3. Related work
Employees of a company are under investigation for leaking sen-
sitive information to a rival company. Upon investigation a docu- The basic technique of carving utilized the presence of special
ment was found in the printer cache, the investigators found that characters known as magic numbers or header footer. These characters
the metadata of the document was intentionally deleted. were identified and extracted in order to retrieve a file. But because of
Forensics investigation: the presence of complex formats this technique became obsolete. Based
The investigators took hold of all the machines in the office and on the type of memory on which the carving is performed we can divide
extracted the OOXML documents from them. Then they examined the our research on related work into the following:

3
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

3.1. Volatile memory Luigi et al. [30] started identifying chunks of data and grouped to-
gether similar chunks using Machine learning classifiers. Finally Aaron
Following performed their research on the Volatile memory: Sara et al. developed a tool for the volatile memory analysis. Based on the
et al. [15] proposed that extraction of relevant data is difficult when literature view carried out we can deduce that the carving of an OOXML
it comes to embedded files likes Microsoft Word Documents. So they file format document is a very less researched upon matter.
performed a statistical analysis on the data and extracted unique values Yong et al. [31] classified the web documents using the Naive Bayes
for each data type. Since these data values are unique they help identify theorem. They first constructed the training data using the WebDoc
the fragments in a better way [16]. system and then used this information to classify the web documents.
Kulesh et al. [17] performed extraction of a collection of fragments. They have used different probabilities, features and event models to
They used the data compression models for classifying the collection of support their technique.
fragments. Diriksen et al. [12] gives a very detailed overview of the OOXML
Bora et al. [18] showed that parts of word documents can be file structure and they also highlight the weakness in the structure of a
customized and can also be used to add malicious content. word document and point out some forensically important facts.
Garfinkel [19] introduced the concept of validation of the recovered
Fu et al. [32] explores the hidden data in the word document and
objects. His work basically focused on the portable Document Format
finds potential hiding spaces in the word document structure. Cantrell
(PDF). He designed a component known as a validator and its purpose
et al. [33] This paper proposes detection of covert messages in the Word
was to make sure that the data which was decoded is decoded correctly
document and proposes solution to detect these hidden messages.
or not.
Also a very detailed literature review has been carried out by us.
Dino et al. [20] they introduced th technique of hybrid classifica-
The paper is mentioned in the bibliography. [34]
tion. They first used the Naive Bayes algorithm to vectorize a document
and further performed SVM classification for the classification of the Table 1 shows the type of memory explored, the type of artifacts
document. It shows reduction in the training time as compared to other retrieved and a simple summary of the technique used for carving.
techniques.
Rasjid et al. [21] used the Information retrieval process to perform 4. Proposed methodology for carving of OOXML document from
classification and clustering, the data used is a huge corpus of audio, the memory
video and documents etc and the data classification in done using kNN
and Naive technique. When a Word file is viewed or edited, this file gets stored in RAM
and stays there till the time it is over written. We are assuming that the
3.2. Non volatile memory metadata of a Word file is lost or destroyed and the file is also removed
from the system [35]. Now to retrieve this file we will use the RAM
Non Volatile Memory was explored by the following: Zaid et al. [22] storage. For this we start off with a RAM dump. The RAM dump in our
used the XML structure to find out if a particular file was being viewed case is taken with the Dumpit tool. We then further analyze this RAM
or not. After this Hyukdon et al. [23] developed a tool that detected
dump. From our previous knowledge we know that a Word file consists
data hidden at empty places of a word document. There are a lot of
of many components so we have to perform these three tasks:
empty of slack spaces present in the word document. This tool again
only worked on the .doc file format. (1) To extract the Open Office Microsoft Word Document compo-
Simpson et al. [24] performed carving of JPEG and ZIP files using nents.
fast validation technique. (2) To extract the textual content correctly from these components.
Binglong et al. [25] just identified the header fragments of the word (3) To classify and group these textual contents correctly in order to
documents. He used the byte level patterns to identify the header of the find which content belongs to which word file.
word document. The header had specific pattern of bytes across every
document. Hence it was identified.
4.1. Collection of text files
Mehdi [26] et al. used the Frequency distribution of bytes to identify
the type of a fragment. Similar type of data generated similar type of
frequency. For our experimentation we used A virtual Windows 7 Operating
Cohen [27] suggested that the carving of fragmented components System (OS) which was made by the VMware Workstation application.
is equivalent to mapping function between the bytes of recovered file We used a 1 GB RAM for our analysis. The first thing we need to do
and the bytes from the memory. He proposed that files can be recovered is to find the text present in the word documents. For this purpose we
using a generator that can produce all mapping functions that can exist. will perform the following steps:
Only the disadvantage is the huge number of mapping functions to be
generated. 4.1.1. Creation of datasets
The dataset used was created ourselves by downloading different
3.3. Others word documents from the internet. Care was taken as to use documents
that have different themes so that the classification process is easier.
Other types of carvings were explored as follows: Wie et al. [28] The detail of the datasets used is mentioned in the experiments section
carved the word document using the control streams available in a
below.
word document. For this he first extracted the header and find all the
control streams found in the document. Then based on this information
they find the fragmented points of the document and from their they 4.1.2. Finding RAM dump
use the internal structure of the word document to find the fragments. Also there are a number of open source tools available for creating
But their technique could not be applied on the OOXML file format. memory dumps. The tool which we used for memory dumping was
Golden et al. [29] gave a new concept of carving known as the live DumpIt. DumpIt is basically a command prompt application, which
carving that saved a lot of memory. Instead of copying the whole disk takes a live dump of your RAM and saves it as a .rar format (raw binary
and keeping a record. He devised a way to perform live forensics by format).We will also use an Open Source RAM analyzer WinHex which
using a secondary operating system. basically helps us to analyze the RAM in bytes.

4
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Table 1
Summary of literature review.
Paper Memory type Artifact Pros Limitation
Zaid et al. Non-Volatile MS word Document Their approach used the XML representation of a word They could only identify the numbers of
document. They created a RAM dump and decompressed the paragraphs and could tell if a file was
RAM dump to analyze and find some parts of the word being viewed\edited or not.
document.
Wei et al. Hard drive MS Word Document They have used the virtual streams of a word document. Their technique only works for the .doc
They first identify the header and SAT and reconstruct the file format
word document using this information along with the virtual
streams information.
Binglong et al. Files downloaded Header fragments of They have identified the header fragments of a number of Their technique only identifies a file
from the internet multiple file types files using Support Vector Machine (SVM) classifier. fragment that contains header
information.
Mehdi et al. Non-Volatile Fragments of multiple They have identified different fragments using a feature It can only identify the (type) of a
file types known as byte frequency distribution (BFD) using fragment.
unsupervised Neural network.
Hyukdon et al. Non-Volatile Hidden data in They have developed a tool for detecting of malicious data Finds only malicious data in .doc format
compound file format in the .doc file by identifying the unused spaces found in the
structure of compound File Format
Golden et al. Local and Remote Image files They proposed a carving technique that reduced the amount Proposed mechanism works well for
drives of storage required for carving. They introduced the idea of images only
live carving of files without copying the whole contents of
the drives.
Luigi et al. Files downloaded All types of Files The idea behind their carving is to first identify each chunk Extra processing required
from the internet of data and allocate it a specific file type and then combine
blocks that have same type using classifiers.
Aaron et al. Volatile memory Live responses They developed a tool for the exploring of RAM and laid Tool development only
emphasis on the importance of RAM in the field of digital
forensics.
Sara et al. Volatile memory All Embedded Files They performed the Statistical Analysis of data and They need to process every byte of the
generated unique values for each data type. document which takes a lot of
processing.
Kulesh et al. Volatile memory Word files They used the data compression to perform statistical They need to have optimal ordering of
modeling to reassemble the fragments of the word document. the data for performing the analysis.
Bora et al. Volatile memory Word files They proposed that data can be hidden in OOXML word This technique is limited to OOXML files
document by defining custom parts of the world file. only.
Simpson et al. Non Volatile JPEG and ZIP files They proposed that carving is a multi tier problem that This is only valid for a very few file
memory basically quickly validates or discards the byte stream data. formats.
Garfinkle et al. Volatile Memory PDF and ZIP He introduced the concept of validator so that the incorrect It cannot detect the semantic errors
data during decoding can be eliminated. present in the data
Cohen et al. Non Volatile All files He said that files can be recovered by designing a generator A lot of processing power is required.
Memory that can produced all possible mapping functions for the
recovered files
Yong et al. Others Web Documents only They used the Naive Bayes Algorithm for training their Technique Only suitable for Web
technique. documents
Dino et al. Volatile All type of Documents They used the Naive Bayes Algorithm and SVM technique to computational intensive since it requires
develop a hybrid technique two algorithms to be used.
Rasjid et al. Volatile All type of Data They used the KNN and Naive algorithm to classify different The data is processed using online
types of data available software

4.1.3. Finding header These first four components will always exist in the same sequence
Since the first component of every file is a header. We start our whereas the sequence of the later components can change according
analysis by manually finding the header of the Word file using WinHex. to the setting of the Word Document. After observing a number of
Since the DOCX is the part of the Microsoft Office Open XML Format Microsoft Word Document we found a pattern in the way the files and
(OOXML) and we say that it consists of a series of compressed ZIP folders were aligned in the hex file. The sequence of the components
folders. That is if we change the extension of any word document from (The direction of arrows is from left to right in first row then it moves
.doc to .zip; and look at the resultant file named [𝐶𝑜𝑛𝑡𝑒𝑛𝑡𝑇 𝑦𝑝𝑒𝑠].𝑥𝑚𝑙 to down to second row, where the direction is from right to left and finally
see the content types. This [𝐶𝑜𝑛𝑡𝑒𝑛𝑡𝑇 𝑦𝑝𝑒𝑠].𝑥𝑚𝑙, our header, looks like it goes to row three in which the direction of arrows is again from left
this: 504𝐵030414000600 followed by 18 additional bytes. The first two to right) is shown in Fig. 3 shown below:
bytes of the Header 0𝑥50 0𝑥40 are also the header for ZIP file. The first five components are constant for any word file and would
appear in the same sequence but the later appear according to the use
4.1.4. Finding XML components in the word file. For example if the font is set first than the Fontable.xml
After finding the header i.e. [𝐶𝑜𝑛𝑡𝑒𝑛𝑡𝑇 𝑦𝑝𝑒𝑠].𝑥𝑚𝑙 we find the next will appear first, following the rest of the components.
component which is 𝑟𝑒𝑙𝑠𝑓 𝑖𝑙𝑒. In winHex we search the Raw bytes of As we know that Open Office XML representation uses the XML as
the RAM dump and look for the tag 𝑟 𝑒𝑙𝑠∕𝑟𝑒𝑙𝑠 in these raw bytes. When its back end programming and MS word is just a compressed folder of
we find this tag we select those bytes and go back to where we find 00 different XML files, each of which represents a specific feature in the
00 00 bytes and select the content after the zeros to the rels tag and save word document [8]. So we observed that the folder or file that appeared
it a new file. Since we know that OOXML is encrypted we will further first in the hex file would end first in the file.
decrypt this part. A screen shot is shared in Fig. 2. After finding rels.xml For Example if [𝐶𝑜𝑛𝑡𝑒𝑛𝑡𝑇 𝑦𝑝𝑒].𝑥𝑚𝑙 comes first in the compressed
we will find the following 𝑤𝑜𝑟𝑑∕𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡.𝑥𝑚𝑙.𝑟𝑒𝑙𝑠, 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡.𝑥𝑚𝑙 con- format than when the file will begin to end the first to appear in the
secutively followed by 𝑡ℎ𝑒𝑚𝑒1, After this comes the settings, 𝐴𝑝𝑝.𝑥𝑚𝑙, ending section would be the [𝐶𝑜𝑛𝑡𝑒𝑛𝑡𝑇 𝑦𝑝𝑒].𝑥𝑚𝑙. So we can consider this
𝑐𝑜𝑟𝑒.𝑥𝑚𝑙, 𝑤𝑒𝑏𝑠𝑒𝑡𝑡𝑖𝑛𝑔𝑠, 𝐹 𝑜𝑛𝑡𝑡𝑎𝑏𝑙𝑒, 𝐶𝑜𝑟𝑒.𝑥𝑚𝑙 and 𝑠𝑡𝑦𝑙𝑒𝑠.𝑥𝑚𝑙. like an XML wrapping.

5
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Fig. 2. Screenshot of RAM dump.

Fig. 3. Components of OOXML file format.

4.1.5. Extraction of textual content 4.1.7. Saving correct files to a pool of text files
Since we have recovered different components of the word doc- Once we have the correct data we will save this data as a text file.
ument, we analyze them further to find the textual data from the Since we would gather multiple text files we will just make a pool of
document. We first observe the [𝐶𝑜𝑛𝑡𝑒𝑛𝑡𝑇 𝑦𝑝𝑒].𝑥𝑚𝑙 data, we know from text files for further processing.
the basic structure of Open Office XML that it contains information Up till now our first two tasks are done. Now what we want to do
about all the parts of the document but does not contain the actual is to correctly identify that which text belongs to which file. For this
content. The Open Office XML internal structure states that the content purpose we will have to perform the Machine Learning Classification
of the word document is present in the 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡.𝑥𝑚𝑙 life, which is a part
algorithm but before that we have to perform preprocessing on our data
of the word folder. But the component is ZIP encoded and the first thing
so that it is readable by the computer.
we need to do is decode it using some ZIP decoder technique. After the
decoding is done we further observe all the document.xml files. In the
tag ⟨𝑤 ∶ 𝑡⟩ is the textual content of that word life. Tag ⟨𝑤 ∶ 𝑡⟩ is a run 4.2. Pre-processing
and it will be present inside a tag ⟨𝑤 ∶ 𝑝⟩ i.e. a paragraph. From this
we can note down the text and the number of paragraphs for each of Computers cannot comprehend the English language and machine
the word document. learning algorithms cannot work with raw text. In order to perform
both the K mean algorithm and the hierarchical algorithm we need to
4.1.6. Correction of mistakes perform some preprocessing on our data. This preprocessing is basically
The plain text take comes from decoding can have a number of making our textual data ready for input to the machine learning algo-
spelling and spacing mistakes. Probably due to the fact that not every rithm. But before doing that we reduce the redundancy in the data. So
character is decoded properly. Some characters are missed hence some
that the size of our data reduces and we use lesser processing power.
spacing and spelling issues can occur. So for now we would just
Following are the pre processing techniques used in our technique:
simply correct the spaces and spelling mistakes that occur during the
extraction of textual content. For this we simply pasted the content in
a word document and it gave us the suggestions for the corrections. 4.2.1. Tokenizing
We rectified the mistakes and saved these textual instances for further We will first collect all the text files and then tokenize each word
processing. document, this means that we will separate every word of the document
For example an encoded sentence ?This is an experimental test docu- so that every word of the document appear as a single entity. If we
ment used for forensics analysis of RAM? can be decoded as follows: tokenize our document corpus above we would get the following result.
This is an experimental text document used for forensics anal-
ysis of RAM Like of the characters stay decoded and disrupt the whole
sentence. With simple observation we can correct the spellings and
spacing of the lines.

6
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

4.2.2. Stopwords So like this the frequency of occurrences of the known words are
Stopwords are words that are very common in text and do not have documented. This is just a simple example whereas in reality the
any significant meaning like are and is. These words do not contribute vectors will be very long so what we do next is that we reduce the
towards the semantic analysis or classification of the text. So we remove size of these vectors.
these stop words for saving our computing power. For reducing our vocabulary what we do is that we will ignore
common words in the documents, these include articles, is, am etc. Also
4.3. Feature extraction we ignore words that contain spelling mistakes. A rather more refined
method is to store two or three words as a single entity. This approach is
For feature Extraction we extract the features based on which we called ‘‘n gram approach’’ [37]. Once the vocabulary is reduced the next
will perform the classification. For extraction of these features we have step is to store the occurrence of these words in each document. One
used the Bag of Words BOWsModel. This model is specifically used way is to use Boolean representation i.e. ‘‘1’’ and ‘‘0’’. Other methods
for the extraction of features from text. Bag of words is a grouping include counting the number of times a particular word exists in the
of the occurrence of similar words in a text document. The sequence document.
in which the words occur do not matter but the frequency of the When we scoring the words according to their occurrence, then
words is important. The basic idea is that similar documents have same the words with leading frequencies would be the words that have no
frequencies of words so the features are basically the frequency of the significant meaning like ‘‘when’’, ‘‘there’’ etc. So to use the unique words
word counts [36]. instead we use the technique of frequency-inverse document frequency
We will first collect all the text files and then tokenize each text (TF–IDF) approach. So Term frequency would mean the frequency
document meaning that we will make each word an individual com- by which a word will appears in a text document and the Inverse
ponent. Then we will define our vocabulary and assign ‘‘0’’ or ‘‘1’’ to Document frequency is to know that how rare is a word across all
every word of the text file based on its presence in the vocabulary. This documents. The basically the IDF of a common word would be low
would make huge vectors with loads of zeros known as sparse vectors so whereas the IDF of a unique word would be high [38].
we will reduce the vectors using techniques like TF–IDF. TF–IDF will
give a frequency to every word and we will use these frequencies to
4.4. Clustering
then perform Machine Learning algorithms.
It the first step for every machine learning technique. To reduce the
size of our dataset, we convert each word into a vector and remove the After preprocessing we will perform the K mean clustering and
common words so the processing time is reduced. There are numerous Hierarchical clustering.
models available for preprocessing but we will use the Bag of Words K mean Clustering is used to find the number of groups or clusters
model. in a data. Entities of one group are entirely different from that of
Following are the steps involved in Bag of words model: another group. Initially we have to define the number of clusters
The first step is the collection of our data. The data can be anything ourselves therefore we have to perform a number of iterations to get
from emails to letters to documents, anything that contains textual our required result. We will start by partitioning our objects into K
elements. From these documents we will form a corpus. Following is groups. It is important that each group must have a non zero value.
an example of corpus we have worked on. It is important to notice that For each partition we created, we will define a centroid. Then we will
every line of the corpus is basically a different text document. assign objects to the clusters. Then using the distance formulae we will
calculate the distance of each object to every centroid and assign the
object to the cluster whose object-centroid distance is the closest. Now
we re allot the clusters and again calculate centroid for each newly
assigned cluster. We keep re allocating till the time we cannot define
any more new clusters. This will then give us our actual number of
The next step is to make the vocabulary. The vocabulary can be de- clusters.
fined yourself or one can download a pre-defined vocabulary. The pur- In Hierarchical Clustering initially each point is a cluster on its own.
pose is to have a collection of unique words from the documents. While At each step we find the two closest clusters and combine them into a
defining the vocabulary care needs to be taken because vocabulary can single cluster. If we have to represent a cluster of more than one point
have a direct impact on the result of classification. we would represent each cluster by its centroid (which is equals to the
Then comes the creation of vectors of every document. One simple average of its points).To determine the nearness of a cluster we will
way is to use the Boolean representation I.e. ‘‘1’’ if that word is measure Cluster distance by distances of centroid. To stop combining
present and ‘‘0’’ for absence of the word in vocabulary. The vectors clusters we will pick a cluster number or stop when you create a bad
representation for the above documents is as follows: cluster i.e. very large distance between centroids.
To evaluate the performance of these techniques we have used the
Precision and Recall. The details regarding the experiments performed
are present in the next section. The detailed diagram of all the steps of
our methodology is shown in Fig. 4.

4.5. Evaluation

Quantitative measure is required for evaluating the performance


of the approach one uses. It is also required for comparison with
the existing techniques and also for tracking improvement in your
own approach. While choosing the evaluation measure it is important
to consider all the choices available and selecting your quantitative
measures accordingly.

7
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Fig. 4. Detailed Methodology.

5. Experiments and results separate document itself. So if our document One contains 10 lines than
we would have to classify 10 documents within one document. So the
As our methodology stated that we had three tasks to be performed size of the dataset increases exponentially by increasing the number
so we have designed our experiments according to those tasks. Each of documents. So for saving our own resources we started off with a
experiment has two states: limited number of documents. The details are shown below in Table 2.
State 1: When the document is open or is being edited
The dataset 1 contains only one document. The dataset 𝑆2 contains
State 2: When the document is closed 4 new files and 1 file from dataset 𝑆1 and dataset 𝑆3 contains 5 newer
Since there are two states, we will need our memory dump in both files and 5 old files from dataset 𝑆1 and 𝑆2. Similarly the contents of
states. We have used a 30 word documents in total. These documents dataset S5 would contain all of the above files too.
are chosen randomly from the internet. The documents chosen have At first we performed all of these experiments when the documents
diverse topics because we are still in the experimental phase and were being viewed then we performed the same experiment for when
we wanted to classify documents in a better way. We divided these the documents were not being viewed. The reason being that we
documents in five sections or datasets. The size of the dataset may wanted to find that how much data would retain in the RAM when
appear small but in reality every line of the document is treated as a a particular file is not being used by the RAM.

8
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Table 2
Details of datasets used.
DataSet Number of files Size
𝑆1 1 12 KBs
𝑆2 5 69.8 KBs
𝑆3 10 131 KBs
𝑆4 15 651 KBs
𝑆5 30 976 KBs

Table 3
No. of extracted XML components.
DataSet Number of XML components
𝑆1 4
𝑆2 20
𝑆3 40
𝑆4 60
𝑆5 120

Before we started our experimentation we wanted to check if the


complete word documents gets loaded in RAM or only some portion
or bits are loaded in RAM. For this we constructed the RAM document Then we analyzed the pool of components and segregated the
byte by byte from RAM. We reorganized all the fragments that were components. The components that contained same type of data were
extracted from the RAM dump and we were able to open the complete grouped together. From these components we used the document.xml
word document. Hence we were able to prove our assumption that the to basically extract the textual content because all of the textual data
complete document gets loaded in the RAM. resides in this component. Document.xml is decoded and a number of
Following are the sets of experiments we designed for carving of the text files are generated. The text is finally checked for probable errors.
Word Document from the RAM. Since in experiment one we are just using one file so classification is not
needed in this case. The Algorithm 2 shows the steps for the extraction
5.1. Experiment 1 of textual content for all datasets:

For the first experiment we used the dataset 𝑆1 first. We viewed just
one file and took a memory dump of the RAM. The first thing that we
need were the components of the word file from which we will later
extract the textual contents. So we start by finding the header. This is
again done using WinHEX. We perform analysis of the RAW bytes and
search for the header bytes which are 504𝐵030414000600. When we find
these bytes in the RAM we label them as the header and it becomes our
starting point for finding the components. Following screenshot shows
the header in the RAM Dump.

After extracting the XML components we perform the ZIP decryption


and generate a number of text files. Table 4 shows the number of
textual files generated from each dataset. The number of text files
that generate from each document depends upon the number of lines
and paragraphs in a file but cannot be generalized for every other
document. Spaces are considered as next paragraph so if a document
Once the header is found we start finding the other components. has spaces within lines then it would be made a text file.
When all the components are found we make a pool of all of these
components. The pseudo code for this is shown in Algorithm 1:
The result of this experiment is shown in 3. Dataset 1 produced a 5.2. Experiment 2
total of 4 XML components. When we performed the same experiment
for the other datasets we got more number of XML components. Which For the second experiment we will use the dataset 𝑆2 which con-
proves that more documents would lead to more XML components in tains 5 Word files downloaded randomly from the internet. As this ex-
RAM. periment has five files so we will have to classify the components i.e. to

9
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Table 4
No. of textual components.
DataSet Size Number of textual components
𝑆1 12 KB 8
𝑆2 76 KB 65
𝑆3 172 KB 158
𝑆4 696 KB 247
𝑆5 976 KB 534

find that which component will belong to which file. For the classifica-
tion we have used the two most commonly available clustering Algo-
rithms. But for applying the Algorithms we perform the preprocessing
on the data. Algorithm 3 depicts the steps for preprocessing.

A simple case is that we count the number of labels assigned by


the system that are correct. In text classification the commonly used
performance evaluation measures are a function of these followings:

After the preprocessing phase is over our text is ready to be used for • True positives (𝑇 𝑃 ): The system predicts +1 for text that is
the machine learning algorithm. The preprocessing has a direct effect actually present in the class.
on our results because by preprocessing we are actually reducing the • True negatives (𝑇 𝑁): The system predicts −1 for text that is not
amount of data that needs processing and also reducing the amount of present in the class.
stop words like is, are and the etc. That have no significant meaning
• False positives (𝐹 𝑃 ): The system predicts −1 for text that is
but would exist parallel in many documents. Hence preprocessing is
actually present in the class.
considered an important step.
• False negatives (𝐹 𝑁): The system predicts +1 for text that is not
The final step is to perform the clustering of the Text. Initially we
present in the class.
performed the both K mean and Hierarchical clustering to our data
because we did not know that which algorithm would perform better. In order to find that how much relevant data is retrieved by the
The Algorithm 4 shows the clustering of objects using the both k mean
system, we use the measures precision and recall. These measures are
and hierarchical clustering and the evaluation metrics used for testing
defined below:
of these techniques:

5.3. Experiment 3, 4 and 5 5.3.1. Precision


Precision is used to measure of how exact a classifier is performing.
Similarly the experiments 3, 4 and 5 were performed using the A classifier with less false positives is a good classifier, whereas a
datasets 𝑆3, 𝑆4 and 𝑆5 respectively. They contained a total of 30 files. classifier with more false positive is a bad classifier. The basic ques-
There is no practical real life scenario in which we would have to open tion answered by precision is what is the correct proportion of positive
a total of thirty files at one time. This experiment was just designed identification?
to check the performance of our classifier when the amount of data Precision = Total number of relevant documents retrieved/Total
is drastically increased. So we opened all 30 files at a single time and number of documents that are retrieved.
took the RAM dump and from this RAM dump we recreated our original 𝑇𝑃
documents. Precision = (1)
𝑇𝑃 + 𝐹𝑃
The results from all the above five experiments are complied to-
Example: If in case we have one true positive and false positive the
gether so that we can have a idea about the performance of our
precision would be half i.e. 0.5
carver. For evaluating our technique we used the Precision and Recall
evaluation techniques. 𝑇𝑃 1
Precision = = = 0.5 (2)
Precision tells us how accurate our technique is working and recall 𝑇𝑃 + 𝐹𝑃 1+1
tells us how much content is correctly retrieved and extracted correctly.
Both precision and Recall together give us the performance metrics of 5.3.2. Recall
our technique. Recall measures how sensitive or complete a classifier is. High value
In classification of text each text instance belongs to one of the of recall would mean lesser false negatives, while low value of recall
available classes. So if each class is assigned a label than the two of would mean higher false negatives. Recall is used to answer How many
the class cases would +1 and −1. +1 for positive and −1 for negative. actual positives were correctly identified?

10
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Table 5 Table 10
Results obtained from hierarchical clustering. Recall results.
Dataset 𝑇𝑃 𝐹𝑃 Dataset Hierarchical clustering K-mean clustering
𝑆1 7 1 𝑆1 0.978 0.802
𝑆2 54 11 𝑆2 0.815 0.769
𝑆3 145 13 𝑆3 0.873 0.734
𝑆4 221 26 𝑆4 0.902 0.651
𝑆5 438 96 𝑆5 0.863 0.621

Table 6 Table 11
Results obtained from K Mean clustering. Results obtained from Multinomial Naive Bayes.
Dataset 𝑇𝑃 𝐹𝑃 Dataset Total components 𝑇𝑃 𝐹𝑃
𝑆1 6 2 𝑆1 8 4 4
𝑆2 49 16 𝑆2 65 43 22
𝑆3 104 54 𝑆3 158 137 21
𝑆4 154 93 𝑆4 247 213 34
𝑆5 321 213 𝑆5 524 321 203

Table 7 Table 12
Results for precision. Results obtained from neural networks.
Dataset Hierarchical clustering K-mean clustering Dataset Total components 𝑇𝑃 𝐹𝑃
𝑆1 0.875 0.750 𝑆1 8 3 5
𝑆2 0.830 0.753 𝑆2 65 38 27
𝑆3 0.917 0.658 𝑆3 158 122 36
𝑆4 0.894 0.623 𝑆4 247 178 69
𝑆5 0.820 0.601 𝑆5 524 356 168

Table 8
Results obtained from hierarchical clustering. The above results as well as the graphical results show that Hier-
Dataset 𝑇𝑃 𝐹𝑁 archical clustering outperforms the K mean Clustering. The range for
𝑆1 7 1 precision for k means is from 0.601 to 0.750 and that of recall is 0.621
𝑆2 53 12 to 0.802. Whereas the precision of Hierarchical Clustering is 0.820 to
𝑆3 138 20
𝑆4 223 24
0.875 and recall is 0.863 to 0.978. This shows that when we have to
𝑆5 461 73 perform the text clustering it is better to use Hierarchical Clustering
instead of K Mean Clustering technique.
If we plot these values of these dataset for precision and recall we
Table 9
Results obtained from K Mean clustering.
get the graphs as shown in Figs. 5 and 6.
After performing two basic clustering techniques we started with
Dataset 𝑇𝑃 𝐹𝑁
using the Multinomial Naive Bayes because it can give good results with
𝑆1 7 1
𝑆2 50 15
small amount of data like a couple of thousand samples. It basically
𝑆3 116 42 computes the conditional probability of occurrence of two events based
𝑆4 161 86 on the individual event’s occurrence. When we applied it on our data
𝑆5 332 202 we did not get any satisfactory results. One reason might be that our
dataset was too small and other might be that the type of data of
each file was very different and the conditional probabilities were not
Recall = Total number of relevant documents retrieved/Total num- accurate enough. (See Table 11.)
ber of relevant documents in the database. After this we used the Word embeddings based neural network clas-
𝑇𝑃 sifier, Word2Vec which is a good model of neural networks. It works
Recall = (3) by having similarity in context of words used. We trained the model
𝑇𝑃 + 𝐹𝑁
Example: If consider in a scenario that the 𝑇 𝑃 is one but the value of on our data to find distance between all of our words. However the
𝐹 𝑁 is 8 then our Recall would be 0.11. technique did not work well for smaller documents because Word2Vec
𝑇𝑃 1 could not capture the right context of the words and pre-trained model
Recall = = = 0.11 (4) of Word2Vec is not suitable for this particular domain of forensics.
𝑇𝑃 + 𝐹𝑁 1+8
Hence the results were again not satisfying. (See Table 12.)
In order to fully evaluate our model we need to consider both precision
and recall. But the fact is that precision and recall are inversely related Then we went on to try SVM our experiment using Support Vector
to each other that means increasing precision would reduce recall and Machine (SVM). SVM provides accurate results with very less training
vice versa. but it requires higher computational resources. SVM works by dividing
For calculation Precision and Recall we first need to find 𝑇 𝑃 , 𝐹 𝑃 the space into two sub spaces using a hyper plane or a line. One of the
and 𝐹 𝑁 based on our clustering algorithm results. Following are the space contains tags that belong to that group and the other space has
tables for 𝑇 𝑃 and 𝐹 𝑃 for calculation of precision for Both Hierarchical tags that do not belong to that group. The good thing about SVM is that
Clustering 5 and K mean Clustering 6. (See Tables 7 and 8.) it can classify and clustering complex data very easily and accurately.
So after the extraction of textual components and the preprocessing The following table gives us the comparison of result by each of the
on the data we perform the both K mean clustering and Hierarchical technique. The results show that SVM performs better than MNB and
clustering and the results of precision and recall are shown in Tables 9 NNs. Hence we used SVM in our paper for improving the results. (See
and 10 respectively. Tables 13 and 14.)
From the above results of 𝑇 𝑃 and 𝐹 𝑁 we calculate the following When we saw the results of 𝑇 𝑃 and 𝐹 𝑃 of the above mentioned
results for Recall: techniques, we clearly got the idea that SVM was performing better

11
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Table 13
Results obtained from SVM clustering.
Dataset Total components 𝑇𝑃 𝐹𝑃
𝑆1 8 7 1
𝑆2 65 57 8
𝑆3 158 149 9
𝑆4 247 231 16
𝑆5 524 455 79

Table 14
Results obtained from SVM clustering.
Dataset 𝑇𝑃 𝐹𝑁
𝑆1 7 1
𝑆2 55 10
𝑆3 142 16
𝑆4 233 14
𝑆5 345 189

Fig. 5. Graphical representation of Precision.


Table 15
Results for precision.
Dataset Hierarchical clustering K-mean clustering SVM clustering
𝑆1 0.875 0.750 0.875
𝑆2 0.830 0.753 0.876
𝑆3 0.917 0.658 0.9430
𝑆4 0.894 0.623 0.9352
𝑆5 0.820 0.601 0.8520

Table 16
Recall results.
Dataset Hierarchical clustering K-mean clustering SVM clustering
𝑆1 0.978 0.802 0.875
𝑆2 0.815 0.769 0.8461
𝑆3 0.873 0.734 0.8987
𝑆4 0.902 0.651 0.9433
𝑆5 0.863 0.621 0.6460

Fig. 6. Graphical representation of Recall.


hence we further proceeded with SVM classification for our research.
Then we found the FNs and calculated Precision and Recall. (See
Tables 15 and 16.)
The results show that using SVM improves the overall performance we performed the same experiment we designed above with another
of our technique and shows better overall results as compared to both Clustering technique known as the Mean Shift.
K Mean and Hierarchical Clustering. Also using by using SVM we Unlike the K mean we again do not need to specify the number of
can generalize our technique for more complex word classification clusters. In mean shift we start off by saying that every feature set is a
techniques. cluster center. Every dataset point has a bandwidth associated with it
The above comparison shows that the Precision and Recall both and has a radius attached to it. We start of with one point and define
for SVM has increased. This shows that our technique will also work a bandwidth. Then we see that how many more point fall under the
efficiently when the dataset increases or we have a word document with same bandwidth. We take a mean of all these points and make a new
bigger size. center. Again define the bandwidth and mark the points that fall under
the bandwidth and take their mean. Similarly we continue doing this
6. Analysis and discussion till all the points converge and the cluster is found. We would repeat
this process for every point in our dataset till we have converged all
This section provides the analysis of the result acquired by the above the points and have read at the optimum level. We will now perform
experiments. As we have mentioned before that the experiments are Hierarchical Clustering and Mean shift for the both scenarios i.e. when
performed in two states i.e. when file is viewed or edited vs. when file system is idle and when system is in use, and the same experiment is
is not being viewed so the results are also divided into two scenarios repeated three times at totally different time intervals. The purpose was
i.e. Scenario 1 and Scenario 2. Scenario 1 is when file is being viewed to know how the performances of the carver changes with the change
or edited and Scenario 2 is when file is not being viewed or edited. in time. The experiments and their results are given in Table 17.
Since we know that each document has about 11 to 12 XML compo-
nents. But instead of extracting all these components we just extract the 6.1. Experiment performed right before the system is idle
first four components from each document. Because all the textual con-
tent is present in the first four components. This helps us to reduce the The Table 17 shows the percentage of correctly carved documents
processing time to multiple folds. Rather than extracting and decoding is 90.54 percent if the system is left idle. This means that no other
the whole document we can just simply extract parts of the document action is performed on the system. This means that almost the whole
and extract the content. document is present in the RAM. If we closely observe we will see
The above experimentation proves that Hierarchical Clustering is a that the datasets 3 and 4 show some anomaly this is because the false
better technique when performing text clustering. For our final analysis positive for the files in these datasets is higher which in turn result

12
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Table 17 Table 20
Experiment performed right before the system is closed For Hierarchical clustering. Experiment performed 10 min after the system is closed using Mean Shift clustering.
Dataset Number of objects Scenario 1 Scenario 2 Perc. carved Dataset Number of objects Scenario 1 Scenario 2 Perc. carved
𝑆1 8 7.5 7 90.625 𝑆1 8 6 4 62.5
𝑆2 65 61 58 91.538 𝑆2 65 44 36 61.492
𝑆3 158 143 140 89.55 𝑆3 158 145 132 87.622
𝑆4 247 222 217 88.866 𝑆4 247 196 178 75.6
𝑆5 534 498 486 92.134 𝑆5 534 465 434 84.136
Avg – – – 90.54 Avg – – – 74.27

Table 18 Table 21
Experiment performed right before the system is IDLE using Mean Shift clusterng. Experiment performed 30 min after when the system was closed using Hierarchical
Dataset Number of objects Scenario 1 Scenario 2 Perc. carved clustering.
Dataset Number of objects Scenario 1 Scenario 2 Perc. carved
𝑆1 8 7.5 7 90.625
𝑆2 65 62 60 94.0538 𝑆1 8 4 3 43.75
𝑆3 158 147 141 91.120 𝑆2 65 39 30 53.07
𝑆4 247 210 198 82.580 𝑆3 158 98 79 56.012
𝑆5 534 478 465 88.289 𝑆4 247 152 138 58.706
𝑆5 534 344 299 60.20
Avg – – – 89.333
Avg – – – 54.3476

Table 19
Experiment performed 10 min after the system is closed using Hierarchical clustering. Table 22
Dataset Number of objects Scenario 1 Scenario 2 Perc. carved Experiment performed 30 min after when the system was closed using Mean Shift
clustering.
𝑆1 8 6 5 68.75
Dataset Number of objects Scenario 1 Scenario 2 Perc. carved
𝑆2 65 53 49 78.46
𝑆3 158 137 129 84.177 𝑆1 8 5 3 63.75
𝑆4 247 217 203 85.020 𝑆2 65 43 34 59.20
𝑆5 534 483 465 88.7 𝑆3 158 115 86 63.56
𝑆4 247 122 108 46.512
Avg – – – 81.021
𝑆5 534 312 205 48.394
Avg – – – 56.2832

is a little less number of documents being classified and reduces the


percentage carving.
One probable cause is the use of dictionary words with more com- 6.3. Experiment performed 30 minutes after activity
mon words which results in classifying the text to be a part of more than
one unique document. This percentage difference is very small and can After some activity has been performed for 30 minutes the per-
be neglected in this case but if we want more refined results then the centage of the carved documents becomes 54.34 percent as shown in
dictionary used should include very unique words. Table 21. It means that the percentage has further reduced.
What this means is that the False positives were very less and the Above are the results obtained for two scenarios performed on a
True positives helped in correctly identifying most parts of the word total of thirty documents. The percentages of carving vary depending
document. Hence we can deduce that if very less activity is performed upon the time intervals. The percentage of carving reduces to almost 50
on the system the chances of retrieving the whole document increase percent when the RAM is used for thirty minutes for different types of
immensely. activity. We can deduce that the performance of our technique depends
When the same experiment is being performed using the Means Shift upon the amount of data that is available in the RAM and hierarchical
Clustering we can see that dataset 2 deviates from the pattern with a clustering is a better approach for performing clustering when we have
percentage of 94.05 which is very high, this had happened because of a large number of clusters the Hierarchical clustering outperforms the
how the Mean shift Clustering performs its clustering. Means shift approach. (See Table 22.)
In this case very distinct means were found and the documents were Our technique has shown promising results but these results depend
classified almost accurately. Since we were using a small dataset so this on the fact that the dataset chosen consists n random documents, if doc-
can happen in an ideal scenario but in real life and even in this same uments with similar content is chosen then the results would definitely
experiment when the number of documents increases the percentage of vary. Also we have chosen a very small dataset for our experimentation
correctly carved components decreases. (See Table 18.) because we had computational limits, however larger datasets would
generate complex clustering problems and would drastically change the
6.2. Experiment performed 10 minutes after activity results.
Table 23 shows a brief comparison between the already existing
When the same experiment is performed after ten minutes of ac- techniques and our technique. Our technique is more adaptable and
tivity the percentage of correctly carved documents for Hierarchical requires less computation as compared to other techniques. We have
Clustering is 81.021 percent 19 and for the Mean shift Clustering the also catered for the fragmentation issues that happen because of the
percentage is 74.27. The percentage of carving has reduced, as shown structure of the RAM. For cases where headers are missing, this tech-
in Table 20 nique would perform the same, because the extraction of the content
The reason for this reduction is that after the ten minutes of activity is basically present in the document.xml file and not the header.
the components of RAM have changed, since our analysis depends
solely on the RAM and its components so any change in the RAM 7. Conclusion and future work
is going to impact the performance of our technique directly. Some
portion of the document must have been over written and if the over In this paper the proposed methodology of extraction of the OOXML
written part has similar text to that of our dataset the False positive file format from the RAM is explained. When we have no information
would rise considerably and the performance would decrease. regarding the file system or the metadata available to us. We have

13
N.U.A. Ali et al. Journal of Information Security and Applications 65 (2022) 103096

Table 23
A comparison between proposed and some already existing techniques.
Features Proposed techniques Existing techniques
Header of fixed magnitude available ✓
If header is lost then file recovery is possible ✓
Processing is less ✓
Does it accommodate fragments of files ✓
Machine Learning clustering techniques are being used ✓

first extracted the OOXML components from RAM, then from these [15] Moody Sarah J, Erbacher Robert F. Sádi-statistical analysis for data type
components we extracted the textual content of the documents. Since identification. In: 2008 third international workshop on systematic approaches
to digital forensic engineering. IEEE; 2008.
these textual content belonged to multiple files so in order to differ-
[16] Aaron Walters, Petroni Nick L. Volatools: Integrating volatile memory into the
entiate which text belonged to which file we performed the clustering digital investigation process. In: Black Hat DC, vol. 2007, 2007, p. 1–18.
using the machine learning techniques. We used the K mean Clustering [17] Shanmugasundaram Kulesh, Memon Nasir. Automatic reassembly of document
and hierarchical clustering techniques and our results showed that fragments via context based statistical models. In: Proceedings of the 19th annual
hierarchical clustering is a better clustering technique when it comes computer security applications conference. 2003. p. 152–9.
[18] Park B, Park J, Lee S. Data concealment and detection in Microsoft Office 2007
to clustering text files. Our results also showed that the performance files. Digit Investig 2009;5(3–4):104.
of our technique basically depends upon the amount of data that is [19] Garfinkel Simson L. Carving contiguous and fragmented files with fast object
available to us in the RAM. The more data available the better carving validation. Digit Investig 2007;4:2–12.
we can perform. [20] Isa D, Lee LH, Kallimani VP, RajKumar R. Text document preprocessing with the
Bayes formula for classification using the support vector machine. IEEE Trans
One aspect that needs to be explored is the extraction of images in
Knowl Data Eng 2008;20(9):1264–72. http://dx.doi.org/10.1109/TKDE.2008.76.
the OOXML file format. Also when the texts are correctly identified then [21] Erlisa Rasjid Zulfany, Setiawan Reina. Performance comparison and optimiza-
the sequence that which paragraph would come first and which could tion of text document classification using k-nn and naïve Bayes classification
come next needs to be identified i.e. sequence of paragraphs should techniques. Procedia Comput Sci 2017;116:107–12.
be recorded. One more thing is that we performed our research on the [22] Al-Sharif Ziad A, Odeh Dana N, Al-Saleh Mohammad I. Towards carving PDF
files in main memory. In: Proceedings of international technology management
RAM, the research can be enhanced to the cloud setup and applicability conference. 2015.
of our technique should be validated. Hence the OOXML file format [23] Hyukdon Kwon, et al. A tool for the detection of hidden data in Microsoft com-
carving is a promising field for researchers. pound document file format. In: 2008 international conference on information
science and security. ICISS 2008, IEEE; 2008.
[24] Garfinkel Simson L, McCarrin Micheal. Hash based carving: Searching media for
CRediT authorship contribution statement
complete files and file fragments with sector hashing and hash db. Digit Investig
2015;14(2015):S95–105.
Noor Ul Ain Ali: Conceptualizing, Methodology, Validation, In- [25] Binglong Li, Wang Qingxian, Luo Junyong. Forensic analysis of document frag-
vestigation, Writing – original draft. Waseem Iqbal: Conceptualizing, ment based on SVM. In: 2006 international conference on intelligent information
Investigation, Writing – review & editing, Supervision. Hammad Afzal: hiding and multimedia. IEEE; 2006.
[26] Chehel Amirani Mehdi, Toorani Mohsen, Mihandoost Sara. Feature-based type
Writing – review & editing, Resources. identification of file fragments. Secur Commun Netw 2013;6(1):115–28.
[27] Cohen MI. Advanced carving techniques. Digit Investig 2007;4(3–4):119–28.
Declaration of competing interest [28] Lin Wei, Xu Ming. A microsoft word documents carving method base on
interior virtual streams. In: Advanced materials research, vol. 433, Trans Tech
Publications; 2012.
The authors declare that they have no known competing finan-
[29] Richard Golden, Roussev Vassil, Marziale Lodovico. In-place file carving. In: IFIP
cial interests or personal relationships that could have appeared to international conference on digital forensics. New York, NY: Springer; 2007.
influence the work reported in this paper. [30] Luigi Sportiello, Zanero Stefano. Context-based file block classification. In: IFIP
international conference on digital forensics. Berlin, Heidelberg: Springer; 2012.
[31] Wang Yong, Hodges J, Tang Bo. Classification of web documents using a naive
References
Bayes method. In: Proceedings. 15th IEEE international conference on tools with
artificial intelligence. Sacramento, CA, USA, 2003, p. 560–4. http://dx.doi.org/
[1] Calhoun William, Coles Drue. Predicting the type of file fragments, DFRWS 2008 10.1109/TAI.2003.1250241.
USA S14-S20. [32] Zhangjie Fu, et al. New forensic methods for ooxml format documents. In:
[2] Pal A, Memon N. The evolution of file carving. IEEE Signal Process Mag International workshop on digital watermarking. Berlin, Heidelberg: Springer;
2009;26(2). 2013.
[3] Lagadec Philippe. OpenDocument and Open XML security (OpenOffice.org and [33] Gary Cantrell, Dampier David D. Experiments in hiding data inside the file
MS Office 2007). J Comput Virol 2008;4(2):115–25. structure of common office documents: A stegonography application. In: Proceed-
[4] Cohen MI. Advanced carving techniques. Digit Investig 2007;4(3–4). ings of the 2004 international symposium on information and communication
[5] Wagner James, Rasin Alexander, Grier Jonathan. Database forensic analysis technologies. 2004.
through internal structure carving. Digit Investig 2015;14(2015):S106eS115. [34] Ali NUA, Iqbal W, Shafqat N. Analysis of windows OS’s fragmented file carving
[6] Mythili MS, Mohamed Shanavas AR. An analysis of students performance using techniques: A systematic literature review. In: Latifi S, editor. 16th international
classification algorithms. IOSR J Comput Eng 2014;16(1):2278–8727. conference on information technology-new generations. ITNG 2019, Advances in
[7] Ziad AAl-Sharif, Bagci Hasan, Asad Aseel. Towards the memory forensics of MS intelligent systems and computing, vol 800, Cham: Springer; 2019.
Word documents. In: Information technology-new generations. Cham: Springer; [35] Garfinkel SL, Migletz JJ. New xml-based files implications for forensics. IEEE
2018, p. 179–85. Secur Priv 2009;7(2):38.
[8] Rice Frank, MC: Microsoft MSDN. Introducing the Office (2007) Open XML file [36] Garneau Nicolas, Leboeuf Jean-Samuel, Lamontagne Luc. Predicting and inter-
formats. 2006, http://msdn.microsoft.com/it-it/library/aa338205.aspx. preting embeddings for out of vocabulary words in downstream tasks. 2019,
[9] Poisel Rainer, Tjoa Simon, Tavolato Paul. Advanced file carving approaches arXiv preprint arXiv:1903.00724.
for multimedia files. J Wirel Mob Netw Ubiquitous Comput Dependable Appl [37] Duh Kevin. Bayesian analysis in natural language processing. 2018, p. 187–9.
2011;2(4):42–58. [38] Christopher Manning, Raghavan Prabhakar, Schütze Hinrich. Introduction to
[10] ECMA International. Final draft standard ECMA-376 Office Open XML File information retrieval. Nat Lang Eng 2010;16(1):100–3.
Formats - Part 1. In: ECMA international publication. 2008.
[11] Rodriguez S. Microsoft Office XML formats? Defective by design. 2007.
[12] Espen Didriksen. Forensic analysis of OOXML documents (MS thesis), 2014.
[13] Adam Lynes, Wilson David. Driving, psuedo-reality and BTK: A case study. J
Investig Psychol Offender Profiling 2015;12(3):267–84.
[14] Roy van Zuijdewijn J de. Remembering terrorism: The case of Norway.
ICCT-Commentary, 2019.

14

You might also like