C.1 FileTypeIdentification ALiteratureReview
C.1 FileTypeIdentification ALiteratureReview
net/publication/325384989
CITATIONS READS
4 674
3 authors, including:
All content following this page was uploaded by Konstantinos Karampidis on 02 February 2023.
Abstract: The rapid growth and use of digital devices (e.g. computers, android tablets and
smartphones), made people vulnerable to cybercrimes. Dr. Debarati Halder and Dr. K.
Jaishankar (2011) define cybercrimes as: "Offences that are committed against individuals or
groups of individuals with a criminal motive to intentionally harm the reputation of the victim
or cause physical or mental harm, or loss, to the victim directly or indirectly, using modern
telecommunication networks such as Internet (Chat rooms, emails, notice boards and groups)
and mobile phones (SMS/MMS)" [1]. For instance, one major and loathsome crime is child
pornography. A child predator may try to hide evidence in a computer or any other digital
device, by changing the file type. This could be easily done by altering the file extension or
the file signature. A digital forensic examiner on the other hand, uses forensic software to
accurate identify the file types in order to determine which files may contain potential
evidence. Nevertheless, current type recognition mechanisms are vulnerable to simple
‘’attacks’’ and even the most widely used commercial forensic software suites may not
predict correctly an intentionally altered file. For example, if someone changes file extension
from .jpg to .doc, the forensic software will identify that the file type is changed.
Nevertheless, if the file signature is changed as well in order to relate to a .doc file, the
forensic software detection algorithm may show poor results. Another important field where
file type identification must be quick and accurate is spam e-mail. Every day massive amount
of spam e-mails are received and lot of time is spent to delete them. Unfortunately this is not
the only disadvantage. Network bandwidth is taken, e-mail servers are slowing down and
eventually an unexperienced end user may not be able to identify if the e-mail hides malicious
content. These are only a few paradigms of the possible damage caused by an unsuccessful
file type recognition. This literature review will try to examine all possible practices of
identifying a file type and try to record current recognizing techniques.
Keywords: file type identification, digital forensics examiner, cybercrime, computer security,
network security
1. INTRODUCTION
A file format is the blueprint of a file. It tells the processing device (e.g. a computer) how the
data within the file are organized and specifies the way the information is encoded in a digital
storage medium. File formats may be either proprietary e.g .dwg for an Autocad file, free
which is not burdened by any copyrights, patents or other restrictions, or open which anyone
can read and study but it may be burdened by restrictions on use.
One popular method used by many operating systems, including Windows –which is the
most popular operating system among computer end users- is to determine the format of a file
based on the end of its name, the letters following the final period. This is known as
the filename extension. For example, text documents are identified by names that end
with .doc (or .docx), and PNG images by .png. In the original FAT filesystem, file names
were limited to an eight-character identifier and a three-character extension, known as an 8.3
filename (also called a short filename or SFN).
141
Many formats still use three-character extensions even though modern operating systems
and applications no longer have this constraint. Some file formats are designed for very
particular types of data e.g. doc or docx stands for document files, jpg declares a compressed
picture etc., while png extension relates to images using lossless data compression.
Nevertheless, other file formats are intended for storage of several different types of data:
the flash video (flv, f4v) format can act as a container for video and audio from Adobe
Systems.
There are thousands of file formats and the list is getting bigger day by day. Since there is
no standard list of extensions and given the fact that more than one format can use the same
extension, this could lead to confuse both the operating system and end users. From a user's
perspective this confusion might be just ignorance or could hide deceit. This literature review
will endeavor to find out which methods of file identification were proposed by the scientific
community.
The identification of a file format is a tiresome task. This task requires specific and expert
knowledge –in computers or other digital device- in order to identify with high accuracy and
speed the correct file type. There are lists containing several thousand of known file types
without having any global standard for the file types. File type detection methods can be
categorized into three kinds: extension-based, magic bytes-based, and content-based methods.
Each of them has its own advantages and weaknesses, and none of them are comprehensive or
infallible enough to satisfy all the requirements.
2.1 EXTENSION-BASED
The fastest and easiest method of file type detection is the extension-based method. Filename
extensions can be considered as a type of metadata. They provide the required information
about the way data might be stored in the file. This approach is very common in Microsoft’s
operating systems and it is used nearly solely. All the file types, at least in the Windows based
systems, are generally accompanied by an extension. In the FAT filesystem, file names were
limited to an eight-character identifier and a three-character extension. Even though newer
filesystems such as NTFS support more characters for file type identification, for reasons of
backward compatibility the three letter approach is supported as well. This approach can be
applied to both binary and text files.
The main advantage of this method is the speed of file type detection. The extension based
method does not have to open the file in order to determine the file type. Nevertheless, it has a
great vulnerability while it can be easily fooled by a simple file renaming. In Linux/Unix
systems an extension is not required to identify the file type. Linux allows optional extensions
of any string regardless of file type. In order to recognize a file type in Linux, the ‘file’
command is used in terminal and the file type is determined by file’s magic bytes.
The second method of file type detection is based on the magic bytes. These are some
predefined signatures and they can be found on file’s header. A file header is the first portion
of a computer file that contains metadata. Metadata may enclose information about the
content, quality, condition of the file. The file header also contains necessary information for
the corresponding application to recognize and understand the file. Magic bytes may also
include some extra information regarding the tool and the tool’s version that is used to
produce the file.
142
Figure 1: The file signature of a .doc file. The magic bytes are in the rectangular box
Checking the magic bytes of a file is indeed much slower method than just checking its
extension since the file should be opened –usually in a hex editor- and its magic bytes should
be read and compared with the predefined ones. As mentioned earlier, the magic bytes
method is adopted by many UNIX based operating systems by typing in a terminal the ‘file’
command. However, these method of identifying a file type has also weaknesses as the
extension-based method: the magic bytes are not used in all file types. They only work on the
binary files and are not an enforced or regulated aspect of the file types. They vary in length
for different file types and do not always give a very specific answer.
Table 1: A list of some widely used file types and their file signatures
File Type Signature
DOC D0 CF 11 E0 A1 B1 1A E1
PDF 25 50 44 46
FF D8 FF E0 xx xx 4A 46
JFIF, JPE, JPEG, JPG
49 46 00
MP3 audio file 49 44 33
PNG 89 50 4E 47 0D 0A 1A 0A
RAR (v5) compressed archive file 52 61 72 21 1A 07 01 00
MS Windows/DOS Executable File (EXE) 4D 5A
There are several thousand’s file types [2] for which magic bytes are defined and there are
multiple lists of magic bytes that are not completely consistent. Since there are not any
standard for what a file can contain, the creators of a new file type usually include something
to uniquely identify their file types. It is common that some programs or their developers may
never put any magic bytes at the beginning of the file header. This approach can be also
deceived. Altering the magic bytes of a file is a much harder way to defeat the true file type
detection than the extension renaming, but the result is the same, i.e. the file type is not
accurately recognized.
The third method of file type detection is to deliberate the file contents and use statistical
modeling techniques. It is a new and promising research area and it is perhaps the only way to
determine the forged file types. It can reveal the malicious file types that their contents do not
match with their claimed types.
McDaniel and Heydari were the first who actually proposed a way for content-based file
type detection [3], [4]. They proposed three different algorithms for the content-based file
type detection: Byte frequency Analysis (BFA), Byte Frequency Cross-correlation (BFC), and
File Header/Trailer (FHT) analysis. These algorithms were used to produce a ‘’fingerprint’’
of each file. Since every file type has a similar ‘’fingerprint’’ with another file of the same
type, the produced ‘’fingerprint’’ is compared to the known one and find the true file type.
The accuracy varied from 23% to 96% depending upon which algorithm was used.
Li et al. [5] made a few changes on the McDaniel and Heydari's method, in order to
improve its accuracy. They stated that it is very difficult to produce one single descriptive
model that accurately represents all members of a single file type class. Instead they proposed
143
to compute a set of centroid models and use clustering to find a minimal set of centroids with
good performance while the use of more pattern data is necessary. This approach resulted to
82% accuracy (one centroid), 89.5% accuracy (multi-centroid) and 93.8% accuracy (more
exemplar files).
Dunham et al. [6] used neural networks to classify 10 different file types from a dataset of
760 files and achieved 91.3% accuracy. Karresand and Shahmehri [7] proposed a method
based on data fragments. In general they used Byte Frequency Distribution (BFD) and
especially the mean and standard deviation to model the file types. Zhang et al. used the BFD
along with a Manhattan distance comparison to detect whether the examined file is executable
or not. Moody and Erbacher [8] used Statistical Analysis for Data type Identification (SADI)
which included average, distribution of averages, standard deviation, distribution of the
standard deviations, kurtosis and distribution of byte values. They used fragments of 200 files
as a dataset of 8 known file types, which resulted to a 74.2% accuracy.
Calhoun and Coles [9] used also a statistical method and specifically Fisher’s linear
discriminant to a dataset of 100 fragments of 2 different file types and achieved an accuracy
of 60.3 – 86% (depending which sequence of bytes was examined). Amirani et al. [10] used
the Principal Component Analysis (PCA) and unsupervised neural networks for the automatic
feature extraction. The classifier they used was a five layer perceptron (MLP), achieving an
accuracy of 98.33% which was the best so far.
Cao et al. [11] used Gram Frequency Distribution and vector space model with results of
90.34% accuracy. Ahmed et al. [12] proposed two very interesting methods. Primary they
used the cosine distance as a similarity metric when comparing the file content. Subsequent
they decomposed the identification procedure into two steps by taking the divide and conquer:
in the first step, the similar files in terms of byte pattern frequencies were grouped into several
clusters. In the next step, the cluster which contained different file types was fed to the neural
network in order for improved classification. They used 2000 files of 10 file types as a dataset
and achieved an accuracy of 90.19%. Ahmed et al. [13] also proposed two new techniques to
reduce the classification time. The first method is a feature selection technique and the K-
nearest neighbor (KNN) classifier was used. The second method is the content sampling
technique, which uses a small portion of a file to obtain its byte-frequency distribution.
Amirani et al. [14] proposed an improved version of their first approach by using an SVM
classifier and finally succeeded to raise the accuracy of the method up to 99.16% for a whole
file. Finally, Evensen et al. [15] used an n-gram analysis with naïve Bayes classifier to a large
dataset of 60000 files (6 file types) with very good results of 99.51% topmost.
3. CONCLUSION
File type identification is a very serious and in many cases a difficult problem. Wrong file
type identification may lead to devastating results such as reduce of network bandwidth,
infection of a computer or a network with malicious code, delays to email servers,
steganalysis problems or even network security issues but the most loathsome of all is the
possibility that criminals of all kind may be able to hide critical evidence of the police
authorities and cover their illegal activities.
A digital forensic examiner on the other hand, uses special forensic software tools in order
to identify which files are forged and may be potential evidence, try to restore the hidden or
lost data and finally discover the true file type.
This literature review recorded the most important work done to this research field. Three
main methods were discussed: extension based, magic-bytes based and contend based
methods. The accuracy of the proposed techniques is high enough but we believe that the
third method – content-based - leaves room for further research, especially if combined with
methods of computational intelligence.
144
4. REFERENCES
[1]. D. Halder and K. Jaishankar, Cyber crime and the Victimization of Women: Laws,
Rights, and Regulations., Hershey, PA, USA: IGI Global, 2011.
[2]. G. Kessler, "File Signatures," [Online]. Available:
http://www.garykessler.net/library/file_sigs.html.
[3]. M. McDaniel, "Automatic File Type Detection Algorithm," Masters Thesis, James
Madison University, 2001.
[4]. M. McDaniel and M. H. Heydari, "Content based file type detection algorithms," in
Proceedings of the 36th IEEE Annual Hawaii International Conference on System
Science (HICSS'03), 2003.
[5]. W. Li, K. Wang, S. J. Stolfo and B. Herzog, "Fileprints: Identifying file types by n-
gram Analysis," in Proceedings of the 6th IEEE Systems, Man and Cybernetics
Information Assurance Workshop, West Point , New York, 2005.
[6]. J. G. Dunham, M. T. Sun and J. Tseng, "Classifying File Type of Stream Ciphers in
Depth Using Neural Networks," in The 3rd ACS/IEEE International Conference on
Computer Systems and Applications, 2005.
[7]. M. Karresand and N. Shahmehri, "File Type Identification of Data Fragments by Their
Binary Structure," in Proceedings of the IEEE Workshop on Information Assurance,
2006.
[8]. R. F. Erbacher and S. J. Moody, "Sadi-statistical analysis for data type identification,"
in Systematic Approaches to Digital Forensic Engineering,2008. SADFE’08. Third
International Workshop on, 2008.
[9]. W. Calhoun and D. Coles, "Predicting the types of file fragments," Journal Digital
Investigation: The International Journal of Digital Forensics & Incident , vol. 5, pp.
14-20, 2008.
[10]. M. C. Amirani, M. Toorani and A. Beheshti, "A new approach to content-based file
type detection," in Proceedings of the 13th IEEE Symposium on Computers and
Communications (ISCC'08), 2008.
[11]. D. Cao, J. Luo, M. Yin and H. Yang, "Feature selection based file type identification
algorithm," in Intelligent Computing and Intelligent Systems (ICIS), 2010 IEEE
International Conference , 2010.
[12]. I. Ahmed, K. Lhee, H. Shin and M. Hong, "Content-based file-type identification using
cosine similarity and a divide-and-conquer approach," IETE Technical Review, vol. 27,
no. 6, pp. 465-477, 2010.
[13]. I. Ahmed, K. Lhee, H.-J. Shin and M.-P. Hong, "Fast Content-Based File Type
Identification," in Advances in Digital Forensics VII, Orlando, FL, USA, Springer
Berlin Heidelberg, 2011, pp. 65-75.
[14]. M. C. Amirani, M. Toorani and S. Mihandoost, "Feature-based type identification of
file fragments," Security and Communication Networks, vol. 6, no. 1, pp. 115-128,
2013.
[15]. J. Evensen, S. Lindahl and M. Goodwin, "File-type Detection Using Naïve Bayes and
n-gram Analysis," in Norwegian Information Security Conference, NISK 2014,
Fredrikstad, 2014.
145