7
7
Radioelectronic and Computer Systems, 2023, no. 3(107) ISSN 2663-2012 (online)
Keywords: digital forensics; metadata; fragmentation; fragmented file; data recovery; file carving; file fragment
identification; file reconstruction; file restoring; artificial intelligence.
1. Introduction blocks [1]. Such disk space areas are only marked as free
for use and remain intact until they are allocated for stor-
1.1 Motivation of research ing other information. As a result, unallocated disk space
can contain forensically important data.
Users constantly create, view, edit, and delete many Some file types (for example, TXT, LOG, DOC)
files when working with data. This is a dynamic process. store their data in an uncompressed form. Their full or
The file system is responsible for the mechanisms and partial contents can be accessed without restoring the en-
rules for storing data on the disk space [1]. Researchers tire object by reading their detected data blocks or iden-
regularly search for deleted information and recover it tifying text fragments using search terms. However, this
when conducting digital forensic examinations. This is is not sufficient when trying to extract the contents of
explained by the fact that when illegal or compromising compound files that use compression, encryption, or have
activities are performed, it is evident that users try to a complex internal structure. These file types include
cover their trails and delete any sensitive information. JPG, BMP, AVI, MPG, DOCX, XLSX, PDF, and
If it does not consider the SSD’s internal pro- SQLITE.
cesses [2], file systems usually optimize their work so A separate digital forensics sphere is the study of
that they do not take any action with deleted data RAM, particularly volatile memory dumps in the
Windows operating system [3]. RAM areas may contain - identify the stages of data recovery where these
the contents of files the user has been working with that techniques are applied;
may not have been stored on the disk [4]. Such files oc- - determine the feasibility of using artificial intelli-
cupy non-contiguous data blocks, the location of which gence and advanced techniques.
may not always be known. Structurally, this work consists of the following sec-
The file recovery process is more difficult when the tions. The research methodology is described in section
files are fragmented and there is no file allocation data. 2. Section 3 discusses the main phases of digital foren-
In the above circumstances, searching for file fragments sics, data recovery with and without file system metadata,
and their corresponding positioning is a time-consuming and the ontological diagram of file carving. Advanced
and complex task with unclear solutions. In addition, it is file carving techniques and their details are provided in
necessary to consider the increasing number of digital de- section 4. Section 5 presents a discussion of the afore-
vices and the amount of information available in general. mentioned techniques. The last section provides conclu-
In recent years, there has been an intensifying use of ad- sions and indicates directions for future research.
vanced file carving techniques to solve and optimize var-
ious stages of such tasks. 2. Research methodology
1.2. Research gap The research hypothesis is that the carving of highly
fragmented files depends on three key factors:
In recent years, researchers have periodically re- - from improving the efficiency of the identification
viewed file carving techniques. The most common meth- of data fragments in unallocated space and/or RAM;
ods of data recovery are presented in [5]. - from the techniques of reconstruction of the de-
Some authors have focused on a survey of various tected file fragments;
data carving techniques of multimedia [6, 7] or JPEG [8] - directly from the file type and its internal content.
files. In [9], the researchers focused only on the effi- For this purpose, the three research questions iden-
ciency analysis of Scalpel and Foremost carving pro- tified for the current literature review are shown in Ta-
cesses. The paper [10] discusses the recovery of a more ble 1.
extensive set of file types focused on fragmented Mi-
crosoft Word documents. Table 1
In other cases, the file carving algorithms were di- Research questions
vided according to a particular principle. For example, # The question
in [11], the authors classified carving methods for JPEG What are the typical stages of file carving,
files into basic and advanced categories and conducted a Q1 and what are the perspectives for improving
detailed analysis of graph theoretic and weightage tech- each stage?
niques. Similar approaches to the classification of file Is it possible to carve fragmented files with-
carving methods are used in [12], where the author also Q2 out a priori information about their internal
presents a taxonomy of file carving techniques. structure and contents?
In addition to the techniques and carving directions What are the perspectives on using artificial
discussed, the work [13] includes data recovery research Q3 intelligence methods in the file carving
area mapping. field?
Despite a relatively large number of surveys on data
recovery techniques, the authors did not comprehen- The search approach is based on selecting and ana-
sively consider the problem of file carving. The works lyzing articles that address the problems of carving
are not sufficiently systematized. In addition, the onto- highly fragmented files or solve individual phases of this
logical relationship between file carving and various as- process.
pects of this process has not been established. The selection process consisted of several stages.
Initially, the most relevant studies were identified by
1.3. Objectives and Contributions searching for keywords in the titles and abstracts. The ar-
ticles were then reviewed for their relevance to the re-
This study systematizes and build schemes for re- search questions. The final set of articles was based on
covering highly fragmented files using advanced tech- the quality of the content.
niques and determine the feasibility of using artificial in- For a complete understanding of the problems that
telligence in this process. appear when recovering fragmented data in the absence
The key issues are as follows: of file system metadata, the literature for the period of
- analyze the existing advanced file carving tech- dynamic digital forensics growth approximately the last
niques; 20 years was analyzed.
206 ISSN 1814-4225 (print)
Radioelectronic and Computer Systems, 2023, no. 3(107) ISSN 2663-2012 (online)
3. Background, Directions and Ontology - the file system metadata is not affected;
- metadata of the deleted file is lost.
3.1 Digital forensics If the metadata is available, the file can be recovered
using information about the location of its data
Conducting digital forensics examinations, re- blocks [1]. The only nuance may be overwriting certain
searchers perform several actions depending on the type areas of the deleted file with other data. Then, at best,
of research, type and number of objects, tasks to be only a partial reconstruction of the file is possible with
solved, etc. [14]. In general, this process can be condi- the subsequent loss of some or all of its contents, depend-
tionally divided into four stages: collection, examination, ing on the file type, number, and character of lost frag-
analysis, and presentation (Fig. 1). ments.
Figure 3 illustrates a possible case of data overwrit-
ing. At the top is the initial state of the disk space with
existing files #1, #2, and #3. At the bottom is the current
state of the exact locations of the disk space, where file
Fig. 1. Digital forensics stages #1 is wholly overwritten and file #3 is partially overwrit-
ten after user manipulations. In this case, if the file sys-
During the first stage, copies of digital media are tem metadata is available, files #2 and #4 will be fully
collected and created. In the next phase, the created im- recovered, file #1 will be lost, and file #3 will be restored
ages are processed. As a rule, a full-fledged study of disk but partially overwritten. At the same time, the recovery
space is conducted: file system analysis, hidden infor- of even partial contents of file #3 is highly questionable.
mation detection, deleted file recovery, signature analy-
sis, indexing, pattern search, etc. In the last two stages,
investigators identify important data, interpret them, and
generate a report with detailed answers to the questions.
Usually, it is not a problem to identify the first frag- antee 100% results and works well only with non-frag-
ment of a file, which in most cases has a clear marker in mented data. For this reason, in non-trivial cases, special-
the form of a header in its initial bytes. However, not all ists often use additional tools for manual data recovery,
files have a footer, which can also have any offset relative such as Hex Editors and highly specific scripts [17].
to the beginning of the block/cluster. If the file consists To evaluate the effectiveness of the software, it is
of three or more fragments, the first key problem is to advisable to determine the number of correctly recovered
determine the data blocks that do not have clear markers, (true positive or TP), incorrectly recovered (false positive
such as the header and footer. Subsequently, it is neces- or FP), and unrecovered (false negative or FN) files [18].
sary to cluster the detected fragments and directly recon- Subsequently, the following criteria are applied:
struct the file or its contents. - precision – the percentage of correctly recovered
Fig. 5 shows the ontological diagram, which indi- files among the results of the utility’s work [18]:
cates the principles of file carving, properties, tools re-
quired for this, the phases of file carving, factors that af- TP
Precision = TP+FP ; (1)
fect the result, techniques used, and criteria used for eval-
uation.
- recall – the percentage of correctly recovered files
The set of software tools is shown in Fig. 5 contains
from their total number in the digital media [18]:
only basic information, is not complete, and depends on
the platform on which the data recovery process is per-
TP
formed. Usually, at the pre-processing stage of file carv- Recall = TP+FN ; (2)
ing, utilities such as FTK Imager, DD, X-Ways Foren-
sics, and EnCase Imager are used to create a full bit-for- - f-measure – the overall performance of a
bit copy of the original media. Then, at the examination tool [18 - 20]:
stage, the disk space is analyzed. For this purpose, uni-
versal tools such as X-Ways Forensics, UFS Explorer, 1
Fmeasure = α/P+(1−α)/R , (3)
EnCase, Magnet Axiom, Autopsy, Forensic Explorer,
and FTK are most often used. They operate on the prin-
ciple of a Swiss Army knife. Scalpel, Foremost, Photo- where P is the precision, R is the recall, α is the numeric
Rec, and RecoverIt are utilities explicitly designed for value from 0 to 1 used to determine precision and recall
data recovery, which is performed using proprietary al- weights;
gorithms. The abovementioned software does not guar-
- reliability – the tool’s efficiency among supported conducted using the keywords and their combinations
file types [18 - 20]: shown in Table 3.
SF−SFN Table 3
Reliability = , (4)
SF Search terms
# Keywords
where SF is the number of supported files in the dataset
1 file carving
and SFN is the number of supported false negatives;
2 data carving
- computational complexity is the amount of re-
3 smart carving
sources required to solve the task. Computational com-
plexity is often estimated by the data processing speed or 4 machine learning
task execution time with the same computing re- 5 artificial intelligence
sources [19, 20]. 6 data recovery
When comparing the effectiveness of the utilities, 7 fragmented files
some researchers [19, 20] also divided false positive files
into two categories: partially recovered (known false pos- The most relevant detected works, their direction,
itive or kFP) and remaining files (unknown false positive brief description, and particularities are shown in Ta-
or uFP). As a result, precision and recall are defined as ble 4. In general, from these studies, advanced file carv-
follows [19, 20]: ing techniques are successfully used to varying degrees
at the identification, clustering, reconstruction, and resto-
TP ration stages in addition to standard digital forensics
Precision = TP+uFP+kFP/β , (5)
methods. However, most studies do not clearly distin-
guish between these phases. For example, clustering and
where β is the numeric value not less than 1 used to de- validation often occur during file reconstruction and/or
termine the relative weight of uFP compared with kFP; restoring. Usually, these issues are solved in parallel. In
addition, most of the authors who addressed the issue of
all−FN reconstruction or restoring performed data validation and
Recall = , (6)
all
verification. Therefore, the last two stages are not men-
tioned separately in Table 4.
where all is the total number of files in the dataset.
In this case, identification means identifying data
Each of the above metrics (Precision, Recall,
blocks related to a specific type of data or files. Cluster-
F-measure, Reliability) can take a value from 0 to 1 and
ing involves dividing the identified data fragments into
show the quality of the tool. Metric values close to 1 in-
groups of blocks belonging to different files. The identi-
dicate that the software shows good performance. Ta-
fied data blocks are placed in the correct order during re-
ble 2 shows the interpretation of the low values of Preci-
construction. Instead, during the restoring process, the
sion, Recall, and Reliability metrics [18 - 20]. It is worth
file’s contents are restored in case of damage or loss of
noting, the authors often compare the number of success-
some file areas.
fully recovered files using their methods with the results
Fig. 6 shows pre-processing and typical stages of
of recognized utilities such as Scalpel, Foremost, Photo-
file carving.
Rec, etc.
Table 2
Interpretation of the low values of the metrics
Metric Interpretation
Precision A large number of false positives
A small number of correctly recovered
Recall
files
A large number of fails when recovering
Reliability
supported file type
Table 4
Advanced file carving techniques
Authors Direction Summary
Applying a set of support vector machines classifiers to determine data blocks
for the files of the following types: BMP, DOC, EXE, GIF, JPG, MP3, ODT,
Zanero [21] Identification
PDF, PPT (9 classes).
Average true positive rate – 90.4%, average false positive rate – 12.4%.
File fragment classification using a supervised learning approach based on sup-
port vector machines combined with the bag-of-words model (24 classes).
Fitzgerald et
Identification The best results were obtained for CSV, PS, GIF, SQL, HTML, JAVA, XML,
al. [22]
and BMP files (>90%). Fragments of PPTX, PPS, DOCX, XLSX, PPT, SWF,
JPG, ZIP, GZ, PDF, and TXT files – 2.3% to 31.8% of prediction accuracy.
Using support vector machines (N-gram vectors) to classify data blocks across
Beebe et al. 30 file types and 8 data types.
Identification
[23] Overall classification rate – 73.4%. High misclassification rate of encrypt, PPT,
ZIP, PPTX, GZIP, PNG, FLV, DOC, XLSX, PDF, DOCX, AVI, and BMP files.
Pan et al. A method to identify the AVI-type blocks based on their internal structure.
Identification
[24] False positive rate – 53% (2 classes).
File fragment classification (18 classes) using N-grams frequencies.
Wang et al. The average prediction accuracy is up to approximately 61%. Problems
Identification
[25] with classifying XLSX, PPTX, DOCX, GZ, PNG, PDF, PPT, and SWF
files.
Comparison of machine learning methods (Decision Trees, Support Vector
Machines, Neural Networks, Logistic Regression, k-Nearest Neighbor) for data
Karampidis
Identification block identification.
et al. [26]
Prediction accuracy – 89% to 100%. Only 4 different classes (JPG, PDF, PNG,
GIF).
Reconstructing graphic files by determining the image to which a fragment be-
Al-Sadi et al. longs. NaiveBayesMultinomialUpdateable, MultiClass, RandomForest, and
Reconstruction
[27] BayesNet classifiers are used to determine the similarity between pixel values.
The best results are 91% to 99.2% on average. Only graphic files.
File fragment classification using a hierarchical machine-learning-based ap-
proach with optimized support vector machines (SVM)
Bhatt et al.
Identification 14 classes – CSV, DOC, HTML, PDF, PPT, XML, XLS, TXT, GIF, JPG, PNG,
[28]
PS, SWF, and GZ. An average accuracy of 67.78%. PPT, PDF, DOC fragments
– the worst results.
Sportiello Construct SVM classifiers to determine the type of data block.
Identification
et al. [29] 8 classes – BMP, DOC, EXE, GIF, JPG, MP3, ODT, and PDF files.
512-byte and 4096-byte fragment type classification using convolutional neural
Mittal et al. networks with automatic feature extraction.
Identification
[30] 65.6% and 77.5% accuracy in the case of 75 classes. HEIC, MOV, 7Z, DMG,
ZIP, EXE, PPTX, DJVU, PDF, DOCX – quite low rates.
File type identification approaches using support vector machines and neural net-
Sester et al. works for n-gram analysis.
Identification
[31] 6 classes – CSV, DOC, JPG, PPT, TXT, and XLS. Approximately 73% to 98%
accuracy in different cases.
4096-byte fragment type classification using a deep convolution neural network.
16 classes – CSV, DOC, DOCX, GIF, GZ, HTML, JAVA, JPG, LOG, PDF,
Chen et al. PNG, PPT, RTF, TEXT, XLS, and XML.
Identification
[32] 70.9% accuracy. Low results – DOC, DOCX, GIF, JPG, PNG, and TEXT.
Represent all bytes of the data block as a grayscale image (automatic feature
extraction).
Using recurrent (RNN), convolutional (CNN), and feed-forward neural networks
(FNN) as classifiers of 512-byte data blocks
Hiester [33] Identification
4 classes: CSV, XML, JPG, and GIF.
Up to 98% accuracy in the best case (automatic feature extraction).
512-byte and 4096-byte fragment type classification using light-weight
Ghaleb et al.
Identification convolutional neural networks.
[34]
66.33% and 79.27% accuracy in the case of 75 classes.
210 ISSN 1814-4225 (print)
Radioelectronic and Computer Systems, 2023, no. 3(107) ISSN 2663-2012 (online)
Continuation of Table 1
Authors Direction Summary
A 512-byte fragment type classification technique that converts the byte stream
Liu et al. in a 2-D grayscale image and then captures both sequences by convolutional neu-
Identification
[35] ral networks.
71.4% accuracy in the case of 75 classes.
Using grayscale image conversion and convolutional neural networks to detect
Bharadwaj the compression algorithm of 4096-byte data block.
Identification
[36] 8 classes – rar, gzip, zip, 7-zip, bzip2, ncompress, lz4, and brotli. The achived
accuracy is 41 % after five epochs.
Using the feature generation model, Byte2Vec, for feature extraction from 4096-
Hague et al.
Identification byte fragments and k Nearest Neighbors for classification.
[37]
35 to 42 classes. An accuracy rate of 74%.
File type identification using feed-forward and convolutional neural networks.
Vulinovic et 18 classes – CSV, DOC, DOCX, GIF, GZ, HTML, JPG, PDF, PNG, PPT, PPTX,
Identification
al. [38] PS, RTF, SWF, TXT, XLS, XLSX, and XML.
Macro-average F1-score: FFNN – 79,93% to 81,38%, CNN – 61,55%.
Identification and restoration of damaged audio files using feed-forward and
Heo et al. Identification
Long Short Term Memory (LSTM) neural network.
[39] Restoring
High rates of identification of audio files.
Restoring fragmented and partially overwritten video files by video frame anal-
Identification yses.
Na et al. [40]
Restoring 40 to 50% of the video with damaged data (50% overwriting) was recovered.
Only MPEC-4 and H.264 video formats.
Recover damaged images with a lost header.
Amrouche et
Restoring 90% accuracy of image properties identification; 78% accuracy for header pre-
al. [41]
diction.
Alghafli et Identification Identification and recovery of video with lost video codecs specifications.
al. [42] Restoring Problems with fragmented files.
Using the byte frequency distribution and rate of change as features for building
a classifier based on SVM. Reassembling fragments of the same file type using
Qiu et al. Identification the PUP approach.
[43] Reconstruction The target file type is JPEG. Other file types are PNG, XML, HTML, PDF, GZ,
ZIP, Office, MP3, and TXT. Better results (40.9% to 85.7%) compared with Pho-
toRec.
Using SVM for high-entropy file fragment classification and Parallel Unique
Guo et al. Identification
Path algorithm for multimedia file reconstruction.
[44] Reconstruction
Only 3 types (DOC, JPEG, C++ source code) were studied.
JPEG carving framework using an extreme learning machine and evolutionary
Identification algorithms for data block identification, validation, and reassembling.
Ali et al. [45]
Reconstruction 90 to 93% accuracy. Problems with more than 2 fragmentation patterns or inter-
twined images.
Analysis of the textual contents of DOCX files in RAM and application of K-
Identification
mean and Hierarchical clustering techniques to recover documents’ texts.
Ali et al. [4] Clustering
54.35% to 90.54% of recovered documents. Possible problems with fragmented
Reconstruction
data blocks.
Identification Finding PDF fragments in RAM using their internal structure. K-Means and
Al-Sharif et
Clustering Hierarchical clustering to define different documents.
al. [46]
Restoring 46.34% to 50.24% of the PDF contents were carved (without file reconstruction).
Finding and reassembling SQLite databases using knowledge of their internal
Zhang et al. Identification
structure.
[47] Reconstruction
Time-consuming method.
Finding and reassembling PNG files using knowledge of their internal structure.
Hilgert et al. Identification Better results compared with PhotoRec, Scalpel, and Foremost.
[48] Reconstruction Problems with recovering files with missing fragments in the middle and/or the
peculiarities of dividing the file into data blocks.
Carving of highly fragmented JPEG files.
Tang et al.
Reconstruction The proposed framework can recover 97% of fragmented JPEG files.
[49]
Fragmentation points are detected using the coherence of Euclidean Distance.
Information security and functional safety 211
Continuation of Table 1
Authors Direction Summary
Carving fragmented text and some graphic files.
Ravi et al.
Reconstruction Only several graphic file types (JPG, PNG, GIF). TXT files – dictionary-based
[50]
approach.
Roussev et Presenting several file fragmentation techniques.
Identification
al. [51] The need to manually examine files and find specific features.
Lin et al. DOC files’ carving method based on internal structure.
Reconstruction
[52] Better results (95,45%) than PhotoRec, Foremost.
Carving fragmented JPEG files using knowledge about their internal structure.
Birmingham
Reconstruction Better results compared with Adroit, FTK 3.3, Scalpel, PhotoRec, ProDiscover,
et al. [53]
and Encase 6. Does not cover out-of-order fragmentation.
Reassembling orphaned JPEG fragments using PRNU fingerprints of the cam-
Durmus et al. Reconstruction
eras.
[54] Restoring
It can also partially collect photos. 42% to 57% fragment localization accuracy
Chang et al. JPEG fragment carving using pixel similarity.
Reconstruction
[55] Success rate – 92%.
Uzun et al. An Advanced Carver for JPEG Files.
Restoring
[56] Ability to recover JPEG files with damaged or lost headers.
Boiko et al. Reconstructing highly fragmented OOXML files.
Reconstruction
[57] Up to 83% recovered files. Problems with embedding in documents.
Hand et al.
Reconstruction Utility for recovering binary executable files using their internal structure.
[58]
Identification Identification and reassembly of EVTX Log fragments using their internal
Xu et al. [59]
Reconstruction structure.
Garfinkel
Reconstruction Fast object validation for bi-fragmented files (JPEG, DOC, and ZIP files).
[16]
graphic files, researchers have successfully proposed de- fragmented files. As a result, many researchers have at-
termining the similarity between pixel values [27, 55], tempted to improve existing techniques and develop their
comparing pixel values on the fragment boundaries [50], own data recovery methods. The mentioned ontological
applying similarity metrics [45, 49], using PNG and scheme can be used as a roadmap for these purposes by
JPEG internal structure features [48, 56], analyzing digital forensics investigators.
PRNU fingerprints of the cameras [54], and utilizing both At the beginning of the study, we identified three
internal structure and content of JPEG files [53]. In addi- questions. The conclusions obtained from the analysis of
tion, the use of internal file structure for its recovery is the papers are summarized below.
possible with many types of compound files, such as Q1. What are the typical stages of file carving and
video [40], SQLite databases [47], DOC [52], OOXML what are the perspectives for improving each stage?
[57], BIN [58], and EVTX [59]. Instead, when recovering In general, in the case of data fragmentation, there
text documents, there is an additional option to use their is a tendency to divide the file carving process into stages
content. Therefore, in these cases, it is possible to use dic- to solve individual subtasks: 1) identification of data
tionary-based techniques [4, 46, 50]. blocks without explicit markers and 2) classification and
Noteworthy is the use of artificial intelligence tech- reconstruction of files or their contents.
niques to restore audio [39] and graphic files [41] with The first of these stages, the identification of data
damaged headers, as well as the use of a validator to re- blocks, is characterized by the widespread use of artifi-
construct video files with lost areas containing video co- cial intelligence techniques. Artificial intelligence mod-
dec specifications [42]. In these papers, the authors pro- els and methods have quite high efficiency. However,
posed methods that provide access to the internal con- most researchers focus on identifying a limited range of
tents of damaged files. As seen from the above data types. Therefore, a perspective direction is the de-
works [39, 41], artificial intelligence methods are a per- velopment of models and methods that can identify a
spective direction in restoring media data content. In gen- wide range of data block types and be self-learning. In
eral, this can be seen as a way to replace computationally addition, the analyzed techniques need to be improved to
complex algorithms. increase accuracy and prevent the loss of important data
The analyzed works show that no universal tool can blocks in case of misclassification.
simultaneously solve all problems in the search, identifi- The main problems of the following phases are the
cation, and reconstruction of file fragments. As can be difficulty clustering the detected data blocks, i.e., assign-
seen from Table 4, two tendencies are traced. In some ing a particular group of fragments to a specific file. Out-
cases (for instance, [23, 32, 33, 39]), researchers focus on of-order fragmentation has additional issues with the cor-
creating new approaches or improving existing methods rect assembly of the file. It can be concluded that there
for specific stages of file carving. This mainly refers to are no universal techniques at these stages, and all of
the data identification phase. Because of the use of artifi- them require a detailed analysis of the file types to be re-
cial intelligence at this stage, many approaches typically covered.
focus on identifying various file or data types, - up to 75 Q2. Is it possible to carve fragmented files without
[30, 34, 35]. In other words, there is a certain universality priori information about their internal structure and con-
in most cases. tents?
Another tendency is to use the peculiarities of the The universal methods used to identify data blocks
internal structure of certain file types or their contents in actually depend on the alphabet’s power of the classifi-
file carving (for example, [4, 43, 46, 48]). The methods cation analysis models. At the same time, the reconstruc-
proposed in these papers are developed for identifying, tion process of files depends on their internal structure
clustering, reconstructing, or restoring only files of spe- and/or contents. Therefore, each described method is ap-
cific types. Almost each of these approaches plied only to recover files of certain types. The only ex-
(e.g., [47, 50, 57]) requires first studying the internal ception in some cases may be approaches for recovering
structure of a file type or gaining access to certain parts bi-fragmented files.
of its contents. Therefore, they are usually not appropri- Q3. What are the perspectives on using artificial in-
ate for other file types. telligence methods in the field of file carving?
The role of artificial intelligence is not restricted to
Conclusions identifying data fragments. It is important to restore ac-
cess to file contents in cases of overwriting or damaging
This paper systematizes advanced file carving tech- some areas of files. Thus, artificial intelligence tech-
niques and presents an ontological scheme of file carv- niques are used to generate headers to restore the content
ing. Although file carving techniques are generally of damaged media files. In general, artificial intelligence
known and understandable, they have several disad- models and methods are a perspective approach to reduce
vantages when working with different types of complexity. Due to the universality of artificial
Information security and functional safety 213
intelligence, it is possible to use artificial intelligence 7. Alrobieh, Z. S., & Raqpan, A. M. A. A. File
techniques to develop carving methods independent of Carving Survey on Techniques, Tools and Areas of Use.
the internal structure and content of files. Transactions on Networks and Communications, 2020,
Limitations. This paper does not provide an over- vol. 8, no. 1, pp. 16–26. DOI: 10.14738/tnc.81.7636.
view of all available data recovery methods. Emphasis 8. Al-Jawry, Rabei., & Mohamad, Kamaruddin.,
Jamel, Sapiee., & Ahmad Khalid, Shamsul Kamal. A
was placed on methods of recovering fragmented files
review of digital forensics methods for JPEG file carving.
with lost or damaged metadata. In addition, the goal was Journal of Theoretical and Applied Information
not to study methods of minimizing the cost of resources Technology, 2018, vol. 96, no. 17, pp. 5841-5856.
and time, such as building a map of unused data [61]. Available at: http://www.jatit.org/volumes/Vol96No17/
Future research should focus on increasing the ac- 17Vol96No17.pdf (accessed 19.09.2023)
curacy and efficiency of the proposed methods and the 9. Rintu Aleyamma Thomas., & Mathai, M. A
resource and time economy. Improving artificial intelli- Survey on File Carving Process Using Foremost and
gence techniques for identifying blocks of data types will Scalpel. National Conference on Emerging Computer
allow the detection of a more complete set of fragments Applications (NCECA2021), Kerala, 2021, vol. 3, no. 1,
of target file types and minimize erroneously omitted pp. 70-72. DOI: 10.5281/ZENODO.5091663.
10. Ali, N. U. A., Iqbal, W., & Shafqat, N. Analysis
data. With regard to data reconstruction, due to the large
of Windows OS’s Fragmented File Carving Techniques:
variety of file types, the current issues are to improve ex- A Systematic Literature Review. 16th International
isting methods and develop new approaches. Conference on Information Technology-New
Generations (ITNG 2019). Springer International
Contribution of authors: conceptualization of the Publishing, 2019, pp. 63–67. DOI: 10.1007/978-3-030-
problem, supervision and revision – Viacheslav 14070-0_10.
Moskalenko; original draft preparation – Maksym 11. Sari, S. A., & Mohamad, K. M. A Review of
Boiko; visualization, review, and editing – Oksana Graph Theoretic and Weightage Techniques in File
Shovkoplias. Carving. Journal of Physics: Conference Series. IOP
All authors have read and agreed with the published Publishing, 2020, vol. 1529, no. 5. DOI: 10.1088/1742-
6596/1529/5/052011.
version of this manuscript.
12. Ramli, N. I. S., Hisham, S. I., & Razak, M. F. A.
Survey of File Carving Techniques. Innovative Systems
References for Intelligent Health Informatics (IRICT 2020). Lecture
Notes on Data Engineering and Communications Tech-
1. Carrier, B. File System Forensic analysis. nologies, Springer, 2021, vol 72, pp. 815–825. DOI:
Addison-Wesley Professional, 2005. 600 p. 10.1007/978-3-030-70713-2_74.
2. Bonetti, G., Viglione, M., Frossi, A., Maggi, F., 13. Alherbawi, N., Shukur, Z., & Sulaiman, R. A
& Zanero, S. Black-box forensic and antiforensic Survey on Data Carving in Digital Forensic. Asian
characteristics of solid-state drives. Journal of Computer Journal of Information Technology, 2016, vol. 15, no. 24,
Virology and Hacking Techniques, 2014, vol. 10, no. 4, pp. 5137-5144. Available at: http://docsdrive.com/pdfs/
pp. 255–271. DOI: 10.1007/s11416-014-0221-z. medwelljournals/ajit/2016/5137-5144.pdf (accessed
3. Ligh, M. H., Case, A., Levy, J., & Walters, 19.09.2023).
A. The Art of Memory Forensics: Detecting Malware 14. Kävrestad, J. Analyzing Data and Writing
and Threats in Windows, Linux, and Mac Memory 1st Reports. Fundamentals of Digital Forensics. Springer
Edition. John Wiley & Sons, 2014. 912 p. International Publishing, 2020, pp. 85–98. DOI:
4. Ali, N. U. A., Iqbal, W., & Afzal, H. Carving of 10.1007/978-3-030-38954-3_10.
the OOXML document from volatile memory using 15. Lin, X. File Carving. Introductory Computer
unsupervised learning techniques. Journal of Forensics. Springer International Publishing, 2018, pp.
Information Security and Applications, 2022, vol. 65, ar- 211–233. DOI: 10.1007/978-3-030-00581-8_9.
ticle no. 103096. DOI: 10.1016/j.jisa.2021.103096. 16. Garfinkel, S. L. Carving contiguous and
5. Darnowski, F., & Chojnaki, A. Selected fragmented files with fast object validation. Digital
methods of file carving and analysis of digital storage Investigation, 2007, vol. 4, pp. 2–12. DOI:
media in computer forensics. Teleinformatics Review, 10.1016/j.diin.2007.06.017.
2015, vol. 1-2, pp. 25–40. Available at: 17. Dubettier, A., Gernot, T., Giguet, E., &
https://yadda.icm.edu.pl/ baztech/element/bwmeta1.ele- Rosenberger, C. File type identification tools for digital
ment.baztech-10af3f4e-db53-4ae5-9b7f- investigations. Forensic Science International: Digital
b7e850dd08d0/c/Darnowski_F_Chojnacki_A.pdf (ac- Investigation, 2023, vol. 46, article no. 301574. DOI:
cessed 19.09.2023). 10.1016/j.fsidi.2023.301574.
6. Pahade, R. K., Singh, B., & Singh, U. A Survey 18. Alghafli, K., Jones, A., & Martin, T.
on Multimedia File Carving. International Journal of Investigating and measuring capabilities of the forensics
Computer Science & Engineering Survey (IJCSES), file carving techniques. Future Information Technology.
2015, vol. 6, no. 6, pp. 27–46. DOI: 10.5121/ijcses. Lecture Notes in Electrical Engineering, Springer, 2014,
2015.6603.
214 ISSN 1814-4225 (print)
Radioelectronic and Computer Systems, 2023, no. 3(107) ISSN 2663-2012 (online)
vol 276, pp. 329–336. DOI:10.1007/978-3-642-40861- 30. Mittal, G., Korus, P., & Memon, N. FiFTy:
8_47. Large-Scale File Fragment Type Identification Using
19. Kloet, S. J. J. Measuring and Improving the Convolutional Neural Networks. IEEE Transactions on
Quality of File Carving Methods. MSc thesis, Eindhoven Information Forensics and Security, 2021, vol. 16, pp.
University of Technology, Department of Mathematics 28–41. DOI: 10.1109/TIFS.2020.3004266.
and Computer Science, The Netherlands, 2007. 111 p. 31. Sester, J., Hayes, D., Scanlon, M., & Le-Khac,
Available at: https://research.tue.nl/files/46916835/ N. A. A comparative study of support vector machine and
635640 -1.pdf (accessed 25.06.2023) neural networks for file type identification using n-gram
20. Laurenson, T. Performance analysis of file analysis. Forensic Science International: Digital
carving tools. IFIP Advances in Information and Investigation, 2021, vol. 36, article no. 301121. DOI:
Communication Technology. Security and Privacy 10.1016/j.fsidi.2021.301121.
Protection in Information Processing Systems, 2013, vol. 32. Chen, Q., Liao, Q., Jiang, Z. L., Fang, J., Yiu,
405, pp. 419–433. DOI: 10.1007/978-3-642-39218-4_31. S., Xi, G., Li, R., Yi, Z., Wang, X., Hui, L. C. K., Liu, D.,
21. Zanero, S. File block classification by Support & Zhang, E. File fragment classification using grayscale
Vector Machines. 2011 Sixth International Conference image conversion and deep learning in digital forensics.
on Availability, Reliability and Security, Vienna, Austria, 2018 IEEE Security and Privacy Workshops (SPW), San
2011, pp. 307-312. DOI: 10.1109/ARES.2011.52. Francisco, CA, USA, 2018, pp. 140-147. DOI:
22. Fitzgerald, S., Mathews, G., Morris, C., & 10.1109/SPW.2018.00029.
Zhulyn, O. Using NLP techniques for file fragment 33. Hiester, L. File Fragment Classification Using
classification. Digital Investigation, 2012, vol. 9, Neural Networks with Lossless Representations.
pp.S44–S49. DOI: 10.1016/j.diin.2012.05.008. Bachelor Thesis, East Tennessee State University.
23. Beebe, N. L., Maddox, L. A., Liu, L., & Sun, M. Undergraduate Honors Theses, 2018, Paper 454, 36 p.
Sceadan: Using concatenated N-gram vectors for Available at: https://dc.etsu.edu/honors/454 (accessed
improved file and data type classification. IEEE 25.06.2023).
Transactions on Information Forensics and Security, 34. Ghaleb, M., Saaim, K., Felemban, M., Al-Saleh,
2013, vol. 8, no. 9, pp. 1519-1530. DOI: S. M., & Al-Mulhem, A. File Fragment Classification
10.1109/TIFS.2013.2274728. using Light-Weight Convolutional Neural Networks.
24. Pan, J., Liu, L., Sun, G., & Tang, Y. A method arXiv (Cornell University), 2023. DOI:
to identify the AVI-type blocks based on their four- 10.48550/arxiv.2305.00656.
character codes and C4.5 algorithm. 2014 International 35. Liu, W., Wang, Y., Wu, K., Yap, K., & Chau, L.
Conference on Behavioral, Economic, and Socio- A Byte Sequence is Worth an Image: CNN for File
Cultural Computing (BESC2014), Shanghai, China, Fragment Classification Using Bit Shift and n-Gram
2014, pp. 1-7. DOI: 10.1109/BESC.2014.7059521. Embeddings. arXiv (Cornell University), 2023. DOI:
25. Wang, F., Quach, T.-T., Wheeler, J., Aimone, J. 10.48550/arxiv.2304.06983.
B., & James, C. D. Sparse Coding for N-Gram Feature 36. Bharadwaj, S. Using convolutional neural
Extraction and Training for File Fragment Classification. networks to detect compression algorithms. arXiv
IEEE Transactions on Information Forensics and (Cornell University), 2021. DOI: 10.48550/arxiv.
Security, 2018, vol. 13, no. 10, pp. 2553-2562. DOI: 2111.09034.
10.1109/TIFS.2018.2823697. 37. Haque, E., & Tozal, M. E. Byte embeddings for
26. Karampidis, K., Kavallieratou, E., & file fragment classification. Future Generation
Papadourakis, G. Comparison of Classification Computer Systems, 2022, vol. 127, pp. 448–461. DOI:
Algorithms for File Type Detection A Digital Forensics 10.1016/j.future.2021.09.019.
Perspective. POLIBITS, 2017, vol. 56, pp. 15-20. Avail- 38. Vulinovic, K., Ivkovic, L., Petrovic, J., Skracic,
able at: https://api.semanticscholar.org/CorpusID: K., & Pale, P. Neural Networks for File Fragment
51882719 (accessed 25.06.2023). Classification. 2019 42nd International Convention on
27. Al-Sadi, A., Yahya, M. B., & Almulhem, A. Information and Communication Technology,
Identification of image fragments for file carving. World Electronics and Microelectronics (MIPRO), Opatija,
Congress on Internet Security (WorldCIS-2013), Croatia, 2019, pp. 1194-1198. DOI: 10.23919/mipro.
London, UK, 2013, pp. 151-155. DOI: 10.1109/ 2019.8756878.
WorldCIS.2013.6751037. 39. Heo, H.-S., So, B.-M., Yang, I.-H., Yoon, S.-H.,
28. Bhatt, M., Mishra, A., Kabir, M. W. U., Blake- & Yu, H.-J. Automated recovery of damaged audio files
Gatto, S. E., Rajendra, R., Hoque, M. T., & Ahmed, I. using deep neural networks. Digital Investigation, 2019,
Hierarchy-Based File Fragment Classification. Machine vol. 30, pp. 117-126. DOI: 10.1016/j.diin.2019.07.007.
Learning and Knowledge Extraction, 2020, vol. 2, no. 3, 40. Na, G. H., Shim, K. S., Moon, K. W., Kong, S.
pp. 216-232. DOI: 10.3390/make2030012. G., Kim, E. S., & Lee, J. Frame-based recovery of
29. Sportiello, L., & Zanero, S. Context-based file corrupted video files using video codec specifications.
block classification. IFIP Advances in Information and IEEE Transactions on Image Processing, 2014, vol. 23,
Communication Technology, 2012, vol 383, pp. 67-82. no. 2, pp. 517-526. DOI: 10.1109/TIP.2013.2285625.
DOI: 10.1007/978-3-642-33962-2_5. 41. Amrouche, S. C., & Salamani, D. Non-
parametric adaptative JPEG fragments carving. Tenth
Information security and functional safety 215
International Conference on Machine Vision, Vienna, Approaches to Digital Forensic Engineering, Berkeley,
Austria, 2017, article no. 106962D. DOI: CA, USA, 2009, pp. 3-14. DOI: 10.1109/SADFE.
10.1117/12.2310079. 2009.21.
42. Alghafli, K., & Martin, T. Identification and 52. Lin, W., & Xu, M. A Microsoft Word
recovery of video fragments for forensics file carving. documents carving method base on interior virtual
2016 11th International Conference for Internet streams. Advanced Materials Research, 2012, vols. 433–
Technology and Secured Transactions (ICITST), 440, pp. 3028-3032. DOI: 10.4028/www.scientific.
Barcelona, Spain, 2016, pp. 267-272. DOI: net/AMR.433-440.3028.
10.1109/ICITST.2016.7856710. 53. Birmingham, B., Farrugia, R. A., & Vella, M.
43. Qiu, W., Zhu, R., Guo, J., Tang, X., Liu, B., & Using thumbnail affinity for fragmentation point
Huang, Z. A new approach to multimedia files carving. detection of JPEG files. IEEE EUROCON 2017 -17th
2014 IEEE International Conference on Bioinformatics International Conference on Smart Technologies, Ohrid,
and Bioengineering, Boca Raton, FL, USA, 2014, pp. Macedonia, 2017, pp. 3-8. DOI: 10.1109/EUROCON.
105-110. DOI: 10.1109/BIBE.2014.31. 2017.8011068.
44. Guo, J., He, J., & Huang, N. Research of 54. Durmus, E., Korus, P., & Memon, N. Every
Multiple-type Files Carving Method Based on Entropy. Shred Helps: Assembling Evidence from Orphaned
Proceedings of the 2015 4th National Conference on JPEG Fragments. IEEE Transactions on Information Fo-
Electrical, Electronics and Computer Engineering, 2016, rensics and Security, 2019, vol. 14, no. 9, pp. 2372-2386.
pp. 521-528. DOI: 10.2991/nceece-15.2016.98. DOI: 10.1109/TIFS.2019.2897912.
45. Ali, R. R., & Mohamad, K. M. RX_myKarve 55. Chang, X., Wu, J., & Hao, F. JPEG fragment
carving framework for reassembling complex carving based on pixel similarity of MED-ED. 2019
fragmentations of JPEG images. Journal of King Saud Chinese Control Conference (CCC), Guangzhou, China,
University - Computer and Information Sciences, 2021, 2019, pp. 8862-8866. DOI: 10.23919/ChiCC.2019.
vol. 33, no. 1, pp. 21–32. DOI: 10.1016/J.JKSUCI. 8865161.
2018.12.007. 56. Uzun, E., & Sencar, H. T. JpgScraper : An
46. Al-Sharif, Z. A., Al-Khalee, A. Y., Al-Saleh, Advanced Carver for JPEG Files. IEEE Transactions on
M. I., & Al-Ayyoub, M. Carving and clustering files in Information Forensics and Security, 2020, vol. 15, pp.
RAM for memory forensics. Far East Journal of 1846-1857. DOI: 10.1109/TIFS.2019.2953382.
Electronics and Communications, 2018, vol. 18, no. 5, 57. Boiko, M., & Moskalenko, V. Syntactical
pp. 695 - 722. DOI: 10.17654/ec018050695. method for reconstructing highly fragmented OOXML
47. Zhang, L., Hao, S., & Zhang, Q. Recovering files. Radioelectronic and Computer Systems, 2023,
SQLite data from fragmented flash pages. Annals of no. 1, pp. 166–182. DOI: 10.32620/reks.2023.1.14.
Telecommunications, 2019, vol. 74, no. 7–8, pp. 451– 58. Hand, S., Lin, Z., Gu, G., & Thuraisingham, B.
460. DOI: 10.1007/s12243-019-00707-9. Bin-Carver: Automatic recovery of binary executable
48. Hilgert, J. N., Lambertz, M., Rybalka, M., & files. Digital Investigation, 2012, vol. 9, pp.S108–117.
Schell, R. Syntactical Carving of PNGs and Automated DOI: 10.1016/j.diin.2012.05.014.
Generation of Reproducible Datasets. Digital 59. Xu, M., Sun, J., Zheng, N., Qiao, T., Wu, Y.,
Investigation, 2019, vol. 29, pp. S22-S30. DOI: Shi, K., & Yang, T. A Novel File Carving Algorithm for
10.1016/j.diin.2019.04.014. EVTX Logs. Digital Forensics and Cyber Crime.
49. Tang, Y., Fang, J., Chow, K. P., Yiu, S. M., Xu, ICDF2C 2017, Prague, Czech Republic, 2017, vol. 216,
J., Feng, B., Li, Q., & Han, Q. Recovery of heavily pp. 97–105. DOI: 10.1007/978-3-319-73697-6_7.
fragmented JPEG files. Digital Investigation, 2016, vol. 60. Memon, N., & Pal, A. Automated reassembly of
18, pp. S108-S117. DOI: 10.1016/j.diin.2016.04.016. file fragmented images using greedy algorithms. IEEE
50. Ravi, A., Kumar, T. R., & Mathew, A. R. A Transactions on Image Processing, 2006, vol. 15, no. 2,
method for carving fragmented document and image pp. 385-393. DOI: 10.1109/tip.2005.863054.
files. 2016 International Conference on Advances in 61. Karresand, M., Warnqvist, A., Lindahl, D.,
Human Machine Interaction (HMI), Kodigehalli, India, Axelsson, S., & Dyrkolbotn, G. O. Creating a Map of
2016, pp. 1-6. DOI: 10.1109/HMI.2016.7449170. User Data in NTFS to Improve File Carving. Advances
51. Roussev, V., & Garfinkel, S. L. File fragment in Digital Forensics XV. 15th IFIP WG 11.9
classification - The case for specialized approaches. 2009 International Conference, Orlando, FL, USA, 2019, pp.
Fourth International IEEE Workshop on Systematic 133-158. DOI: 10.1007/978-3-030-28752-8_8.
Бойко Максим Володимирович – асп. каф. комп’ютерних наук, Сумський державний університет,
Суми, Україна; старший детектив, Управління аналітики та обробки інформації, Національне антикорупційне
бюро України, Київ, Україна.
Москаленко В’ячеслав Васильович – канд. техн. наук, доц., доц. каф. комп’ютерних наук, Сумський
державний університет, Суми, Україна; докторант каф. комп’ютерних систем, мереж та кібербезпеки,
Національний аерокосмічний університет ім. М. Є. Жуковського “Харківський авіаційний інститут”, Харків,
Україна.
Шовкопляс Оксана Анатоліївна – канд. фіз.-мат. наук, старш. викл. каф. комп’ютерних наук,
Сумський державний університет, Суми, Україна.
Maksym Boiko – PhD Student at Computer Sciences Department of Sumy State University, Sumy, Ukraine;
Senior Detective, Information Processing and Analysis Department, the National Anti-Corruption Bureau of Ukraine,
Kyiv, Ukraine,
e-mail: [email protected], ORCID: 0000-0003-0950-8399, Scopus Author ID: 58199360000.
Viacheslav Moskalenko – PhD, Associate Professor at Computer Science Department of Sumy State
University, Sumy, Ukraine; Doctoral Student at Department of Computer Systems, Networks and Cybersecurity,
National Aerospace University “KhAI”, Kharkiv, Ukraine,
e-mail: [email protected], ORCID: 0000-0001-6275-9803, Scopus Author ID: 57189099775.
Oksana Shovkoplias – PhD, Senior Lecturer at Computer Science Department of Sumy State University, Sumy,
Ukraine,
e-mail: [email protected], ORCID: 0000-0002-4596-2524, Scopus Author ID: 55647364100.