Computational Intelligence To Aid Text F
Computational Intelligence To Aid Text F
Identification
Santhilata Kuppili Venkata, Alex Green
The National Archives
Abstract
One of the challenges faced in digital preservation is to identify the file types when
the files can be opened with simple text editors and their extensions are unknown. The
problem gets complicated when the file passes through the test of human readability, but
would not make sense how to put to use! The Text File Format Identification (TFFI)
project was initiated at The National Archives to identify file types from plain text file
contents with the help of computing intelligence models. A methodology that takes help of
AI and machine learning to automate the process was successfully tested and implemented
on the test data. The prototype developed as a proof of concept has achieved up to 98.58%
of accuracy in detecting five file formats.
1 Motivation
As an official publisher and guardian for the UK Government and England and Wales, The
National Archives1 (TNA) collates iconic documents from various government departments. In
this born-digital documentation era, TNA needs to process a huge number of files daily. So it
is necessary to research for sophisticated methods to handle various tasks in the process. File
format identification of plain text files is one such example.
In this digital era, files are often generated in an integrated development environment. Each
document is supported by multiple files. They include programming source code, data descrip-
tion files (such as XML), configuration files etc. Contents of the supporting files are often
1
http://www.nationalarchives.gov.uk
1
human-readable. i.e they can be opened as plain text files using a simple text editor. But if the
file extensions are missed or corrupted, it is hard to know how to put the file to use!! A sample
file with missing extension is shown in the Fig. 1. The file shown here contains some charac-
ters from the Roman alphabet written in a column. At first glance, one would think that this
must be a simple exercise to typewriting characters in order. Someone familiar with the Unix
environment cannot rule out the possibility of this file being a part of a bash script/commands.
To a naive user, they appear to be a set of encrypted commands. Nevertheless, the question re-
mains, how can we make use of the file even though it does not make any sense at the moment?
If we have thousands of such files, it is impossible to examine each file physically, so we need
to automate the process of file type identification.
To estimate the enormity of the problem, we have cloned files from publicly available
Github repositories of the Government Digital Service (GDS)2 and The National Archives3 .
As TNA handles similar data, it makes sense to use these repositories as a testbed. In all, we
have cloned 1457 public repositories from these two sources. They contain over 410,000 files
representing 928 file types that can be opened with a simple text editor program. Fig. 2 shows
a partial bar graph of file type categories that contain at least 500 files of each format in our
sample dataset.
2
immediate attention. The distribution of 13 main file categories (in blue) and the corresponding
number of files (in red) for each category available in the corpus is shown in Fig. 3. The 14th
category is a collection of all file types that cannot be categorized as any of the remaining 13
file categories. From the above patterns and DROID [1] reports, we found that programming
codes and text files are the major file types that TNA receives regularly. So for the initial
experimentation, we started with a small subset (22,292 files) consisting of five file types: two
programming source code file types (Java, Python), plain text files (.txt) and two delimiter
separated file types (.csv and .tsv).
Aside from digital archiving, file type identification is a serious problem in the areas of
digital forensics and cybersecurity. The research in digital forensics is mainly focused on the
identification of image file types and their metadata. While most of the research is targeted for
binary file formats, very little work is done in the plain text file category. Though our target file
types are different, we could adopt their approaches and methods to some extent.
There are mainly two approaches to work with plain text files. The first approach is to
treat the file as a plain text file (no prior knowledge about the file type) and search for spe-
cific characteristics for possible file types. The signature of the file type is a combination of
characteristics of that file type. This is a generic method and can be extended to any number
of file types. However, this approach needs a thorough knowledge of the file type to generate
its characteristical features. The second approach is based on prior knowledge about a file.
For example, if we have an intuition about a file belonging to a programming language, we
could validate the file type by running its compiler(s) or search for specific text patterns corre-
3
sponding programming language. We followed the first approach as our file corpus consists of
a variety of file types. We have developed a flexible methodology (described in section 3) to
reflect this approach. Our initial prototype makes use of machine learning algorithms and can
identify five formats: .py, .java, .txt, .csv, and .tsv.
Research Question: How to correctly identify the file type of a plain text file from its
contents?
To answer this question, we have carried out an extensive literature survey to review exist-
ing methodologies, approaches formulated and their adaptability from other fields. The survey
of methods and their usability is explained in section 2. Relevant algorithms reconstructed to
our task are explained in section 4. Given the nature of the problem, we narrowed our prob-
lem belonging to the classification category of supervised learning. A Python-based machine
learning prototype was developed to understand the intricacies of different classification mod-
els during the ‘proof of concept’ development phase. The model construction, testing and
evaluation are discussed in section 5.
2 Literature Review
Automated file type identification (AFTI) is a highly researched problem in digital forensics
and related fields. Researchers have concentrated on the identification of image file types with
corrupted metadata and missing chunks from the contents. Methods were developed to recon-
struct damaged files from their fragments. File type identification is using metadata is a widely
accepted paradigm for AFTI. It includes information about file extensions, header/footer sig-
natures [1, 2], binary information such as magic bytes etc. All these methods work well when
the metadata is available and unaltered. However, traditional approaches are not reliable when
the integrity of the metadata is not guaranteed. An alternative paradigm is to generate ‘finger-
prints’ of file types based on the set of known input files and using fingerprints to classify the
type of the unknown file. Another prominent approach is to calculate the centroid for a given
file type from its salient features. Each unknown file is examined for the distance from the
known set of centroids to predict the file type. On the other hand, the centroid paradigm uses
supervised and unsupervised learning techniques to infer a file (object) type classifier by ex-
ploiting unique inherent patterns that describe a file type’s common file structure. Alamri et al.
[3] have published a taxonomy of file type identification. Their work ranges over 30 different
algorithms and approaches. In this section, we review the literature related to predicting file
type from fragments and content-based methods.
4
2.1 File Type Identification from File Fragments
Identification of file type from its fragments is mainly used as a recovery technique. It allows
file recovery of the file (or rebuild) without contextual information or metadata. This process is
also referred to as ‘file carving’ in some of the literature. Image type files are mainly targeted
by this technique.
Calhoun et al. [4] investigated two algorithms for predicting the type from fragments in
computer forensics. They have performed experiments on the fragments that do not contain
header information. First algorithm was based on the linear discriminant and the second was
based on the longest common sub-sequences of fragments. Their work provided various rele-
vant statistics such as byte frequency, entropy, etc. as features to predict the file type. Ahmed
et al. [5, 6] also published two techniques to identify the file types from file fragments. These
techniques aim to reduce the time consumed to process the contents. Their first technique se-
lects a subset of features describing the frequency of occurrence of certain fragments. The
second technique speeds up classification by randomly sampling file blocks. They have per-
formed experiments on .png, .jpg and .tiff file types. Poisel et al. [7, 8] published a compre-
hensive survey of file carving research to detect the file types from their fragments. They have
also provided a file carving ontology useful for researchers. In a similar work, Evensen et al.
[9] explored the use of the naive Bayes classifier combined with n-gram analysis of byte se-
quences in files to correctly identify the file type. Gopal et al. [10] presented the evaluation and
analysis of the robustness of Support Vector Machine (SVM) and k-Nearest Neighbours (kNN)
in handling damaged files and file segments. They have restricted their study to the file type
identification from the metadata. The evaluation reveals that SVM and kNN learn better than
any commercial off-the-shelf tools developed based on file extensions. In his thesis, Wilgen-
bus [11] presented a combined multi-layer perceptron neural network and linear programming
discriminant classifiers for the multiple class file fragment type identification problems. This
solution could help our text file format identification problem, as neural networks learn from
features of the contents and helps in classification of discrete file types. In their work, Karam-
pidis et al. [12, 13] examine a three-stage methodology for AFTI, using feature selection (Byte
Frequency Distribution) and feature selection using genetic algorithm. They have tested with
classification models including decision tree, SVM, neural network, logistic regression and
kNN. Their methodology showed that artificial neural networks performed with a very high
and exceptional accuracy in most cases.
5
the target file types have almost similar structures. For example, Java and C programming
source codes. We need to generate file features and classification models in such a way that
they describe file types distinctly.
Other improvements in this area include neural networks [17] and Byte Frequency Distri-
bution (BFD) to classify file types [18, 19]. Amirani et al. [20] proposed a content-based file
type detection method for files normalised using BFD. Their model uses principal component
analysis for feature selection. The model is then fed into an auto-associative unsupervised
neural network. Mitlohner et al. [21] published a comprehensive study of characteristics of
open data CSV files. Their work analyzes an open data corpus containing resources from a
data consumer perspective. Their study provided a deep insight to feature engineering CSV
file type.
Predicting the file type from the contents of text files complicates the problem of AFTI.
Though several approaches are available, they are highly domain-specific. Hence we could not
use them for the identification of file types from their contents. We need to research generic
methods to fill this gap based on existing approaches.
3 Methodology
TNA deals with a huge variety of file types for digital archiving. Hence an iterative process
model is appropriate to include file features gradually. The methodology should be flexible to
add more file types progressively. As and when a new file type is to be included, its features
(specific characteristics) should be compared against the existing features of other file types
and engineered to add to the list. Hyperparameters for the models should be tuned to get a
better performance. The flow graph in Figure 4 shows the methodology developed and used by
us.
Fil
e
c
o r
pus
Exter
nal Fe
ature Fe
atur
e De
velopa
r
esourc
es e
xtrac
tion e
ngi
neer
ing cl
ass
if
ier
Te
st
Hype
rpar
ame
termodi
fi
cat
ion
a) The File Corpus is the set of files that serve as the dataset to the file type identification
task. As TNA is set to receive digital documents from various government departments,
it was decided to use data files from the Github repositories of the Government Digital
Service. We have also collected file samples from TNA’s Github repositories to compile
the data file corpus for this prototype.
6
in various environments. For example, DROID - the binary file format identifier tool is
used to eliminate known file types as a first step.
c) Feature Extraction Features of a file play an important role in the identification of their
types. Feature extraction is the process of extracting features from the files in the cor-
pus. Characteristic features that determine the styles and nature of the file type will be
extracted during this step.
d) Feature Engineering is the process of using domain knowledge of the data to create
features that make machine learning algorithms work. It helps to fine tune the machine
learning models by reducing the computational processing overhead.
e) Classifier Development and Test Machine learning (ML) is chosen to develop a classifier.
ML algorithms are used to understand and extract the patterns from the data and help to
predict the outcome.
4 Data Pre-processing
Text file format identification is a non-linear learning problem given the correlation between
features of different file types. For example, a very high correlation between the Java and
Python programming file structures lead their features to be interdependent. Similarly, .csv
and .tsv files share their file features. Many times a comma separated files (.csv) may contain
unformatted textual lines, leaving a very small difference between .txt and .csv file formats to
differentiate. The following phases in the life-cycle of the development of the model mirror
these facts.
• a Python source code file differs from a Java source code file by its commenting style,
strict indentation requirement at the beginning of each line of the code, use of specific
keywords etc.
• the Java programming source code needs to follow a pre-defined structure to be able to
compile successfully.
• while every line in the Java code needs to be ended with a ’;’ (semi-colon), python does
not need any.
• even though .csv and .tsv files are largely categorised as text-based, they can be recog-
nised by the use of the number of commas (or other delimiter characters). Hence a
7
comparison of the number of commas between files could become a deciding factor to
classify a file between .csv and .txt.
• in general, a .txt file got no restrictions on how to create one compared to .csv or .py.
It is difficult to extract a pattern from a normal .txt file. Hence we have used the count
of ‘stopwords’ (common words in English) from the NLTK library. We started with the
assumption that the usage of stopwords is more in normal text files than programming
codes or data files.
After feature engineering, 33 features were selected for classification. Features extracted
and used for classification are listed in the Appendix.
4
https://github.com/nationalarchives/Text-File-Format-Identification
8
Results:
The evaluation of the decision tree classification is shown in the confusion matrix in Table 1.
The train-to-test ratio is set ideally as 80:20 to achieve better accuracy. Though the accuracy
of classification is very high, we consider the precision metric more, given the non-uniform
distribution of file types in the file corpus.
Table 1: Accuracy and precision metrics for Decision Tree classification model
Accuracy 98.58%
Precision score 86%
Results:
The kNN classification evaluation is shown in a confusion matrix in Table 2. The train-to-test
ratio is set ideally to 80:20 to achieve better accuracy. The ‘minkowski’ distance metric6 is
used to establish the distance between classes. The value for k is set to 3. Due to the uneven
distribution of the file types in the file corpus, though the accuracy is 94%, the precision score
should be considered more.
Table 2: Accuracy and precision metrics k-nearest neighbour classification model
Accuracy 94.03%
Precision score 80%
9
of an input layer to receive the signal, an output layer that makes a decision or prediction about
the input, and between these two, an arbitrary number of hidden layers that are the true com-
putational engine. MLPs with one hidden layer are capable of approximating any continuous
function. They train on a set of input-output pairs and learn to model the correlation between
those inputs and outputs. Training involves adjusting the parameters, or the weights and biases,
of the model, in order to minimize error. Backpropagation is used to make those weight and
bias adjustments relative to the error, and the error itself can be measured in a variety of ways.
A neural network executes in two phases: Feed-forward and Backpropagation.
Feed-forward These are the steps performed during the feed-forward phase:
1. The values received in the input layer are multiplied with the weights. A bias is added to
the summation of the inputs and weights.
2. Each neuron in the first hidden layer receives different values from the input layer de-
pending upon the weights and bias. Neurons have an activation function that operates
upon the value received from the input layer.
3. The outputs from the first hidden layer of neurons are multiplied with the weights of the
second hidden layer; the results are summed together and passed to the neurons of the
proceeding layers. This process continues until the outer layer is reached. The values
calculated at the outer layer are the actual outputs of the algorithm.
1 1 1 5node
s
2 2 2 1
I
nputf
il
es Out
putc
las
sif
ica
tionr
esul
t
3 3 3 2
[
[0,
0,
1,
0,
0],
[1,
0,
0,
0,
0],
….]
3
31 10 10 4
32 11 11 5
33 12 12
I
nputl
aye
r Hi
dde
nla
yer
s Out
putl
aye
r
Back Propagation The predicted output is not necessarily correct right away after the feed-
forward phase. To improve these predicted results, a neural network will go through a back-
propagation phase. During backpropagation, the weights of neurons are updated in a way that
the difference between the desired and predicted output is as small as possible.
The MLP model generated for our classification problem was designed as a 3-layer fully
connected neural network with 33 nodes in the input layer, 12 nodes each in the hidden layers
10
and 5 nodes (one for each of the output class) in the output layer. The number of nodes in each
of the layers was decided on a trial and error basis. Our MLP model is shown in Figure 5. The
parameters set for the MLP is given in Table 3.
Table 3: The values set for MLP parameters
Results:
The MLP neural network classification results are presented in a confusion matrix in Table 4.
The train-to-test ratio is set ideally as 80:20 to achieve better accuracy. The train-test set is kept
the same across all classification methods. The test accuracy 97.80% is almost as good as the
decision tree.
Table 4: Accuracy MLP Neural network classification model
11
Acknowledgements
To Paul Young and Ian Henderson, for their deep insights and support at every stage.
References
[1] DROID. http://droid.sourceforge.net/, 2013.
[2] TrID. http://mark0.net/soft-trid-e.html.
[3] Nasser S. Alamri and William H. Allen. A taxonomy of file-type identification techniques. In Proceedings
of the 2014 ACM Southeast Regional Conference, ACM SE ’14, page 49:1–49:4, New York, NY, USA, 2014.
ACM.
[4] William C. Calhoun and Drue Coles. Predicting the types of file fragments. Digit. Investig., 5:S14–S20,
September 2008.
[5] Irfan Ahmed, Kyung suk Lhee, Hyunjung Shin, and ManPyo Hong. Content-based file-type identification
using cosine similarity and a divide-and-conquer approach. IETE Technical Review, 27(6):465, 2010.
[6] Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin, and Man-Pyo Hong. Fast content-based file type identifica-
tion. In Advances in Digital Forensics VII, page 65–75. Springer Berlin Heidelberg, 2011.
[7] Rainer Poisel and Simon Tjoa. A comprehensive literature review of file carving. In 2013 International
Conference on Availability, Reliability and Security. IEEE, sep 2013.
[8] Rainer Poisel, Marlies Rybnicek, and Simon Tjoa. Taxonomy of data fragment classification techniques. In
Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineer-
ing, page 67–85. Springer International Publishing, 2014.
[9] John Daniel Evensen, Sindre Lindahl, and Morten Goodwin. File-type detection using naive bayes and n-gram
analysis. In 2014: NISK 2014, 2014.
[10] Siddharth Gopal, Yiming Yang, Konstantin Salomatin, and Jaime Carbonell. Statistical learning for file-type
identification. In 2011 10th International Conference on Machine Learning and Applications and Workshops.
IEEE, dec 2011.
[11] Erich Feodor Wilgenbus. The file fragment classification problem : a combined neural network and linear
programming discriminant model approach. Master’s thesis, N, 2013.
[12] Konstantinos Karampidis, Ergina Kavallieratou, and George Papadourakis. Comparison of classification al-
gorithms for file type detection a digital forensics perspective. Polibits, 56:15–20, 2017.
[13] Konstantinos Karampidis and Giorgos Papadourakis. File type identification - computational intelligence for
digital forensics. The Journal of Digital Forensics, Security and Law, 2017.
[14] Mason McDaniel and M.Hossain Heydari. Content based file type detection algorithms. In 36th Annual
Hawaii International Conference on System Sciences, 2003. Proceedings of the. IEEE, 2003.
[15] M. McDaniel. Automatic file type detection algorithm. Master’s thesis, 2001.
[16] W. J. Li, S. J. Stolfo, and B. Herzog. Fileprints: identifying file types by n-gram analysis. In Proceedings
from the Sixth Annual IEEE SMC Information Assurance Workshop, page 64–71, June 2005.
[17] J. G. Dunham and J. C. R. Tseng. Classifying file type of stream ciphers in depth using neural networks.
In The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005., page 97–, Jan
2005.
[18] M. Karresand and N. Shahmehri. File type identification of data fragments by their binary structure. In 2006
IEEE Information Assurance Workshop, page 140–147, June 2006.
[19] L. Zhang and G. B. White. An approach to detect executable content for anomaly based network intrusion
detection. In 2007 IEEE International Parallel and Distributed Processing Symposium, page 1–8, March
2007.
[20] Mehdi Chehel Amirani, Mohsen Toorani, and Sara Mihandoost. Feature-based type identification of file
fragments. Security and Communication Networks, 6(1):115–128, apr 2012.
[21] J. Mitlöhner, S. Neumaier, J. Umbrich, and A. Polleres. Characteristics of open data csv files. In 2016 2nd
International Conference on Open and Big Data (OBD), page 72–79, Aug 2016.
[22] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison Wesley, us ed
edition, May 2005.
12
[23] H. Ramchoun, M. A. Janati Idrissi, Y. Ghanou, and M. Ettaouil. Multilayer perceptron: Architecture opti-
mization and training with mixed activation functions. In Proceedings of the 2Nd International Conference
on Big Data, Cloud and Applications, BDCA’17, page 71:1–71:6, New York, NY, USA, 2017. ACM.
[24] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016.
Feature Description
file name Name of the file along with its complete path
file extension File extension if available
num lines Number of lines in the file separated by newline character
header info File header information if available
trailer info Trailer information, if available
indentation Number of spaces used for indentation (specific to Python)
eol marker End-of-line markers, if any (specific to Java)
sol marker Start-of-line markers, if any
isLowercaseMethods Whether methods/functions start with lower case alphabets
num stopwords Number of stop words used (specific to text files)
num Python keywords Number of Python key words within the file
num Java keywords Number of Java key words used in the file
Python comments Number of Python style of comments
Java comments Number of Java style of comments
angular brackets Number of angular brackets used
curly brackets Number of curly brackets used
round brackets Number of round brackets used
square brackets Number of square brackets used
num def Number of ’def’ used (specific to Python
num returns Number of times the key word ’return’ used
if else proximity Number of words between if and else (specific to programming codes)
num carat Number of times the carat symbol used (specific to csv and tsv)
num comma Number of times the comma symbol used (specific to csv and tsv)
num fullstop Number of times the fullstop symbol used (specific to csv and tsv)
num tab Number of times the tab used (specific to csv and tsv)
num semicolon Number of times the semi colon symbol used (specific to csv and tsv)
num colon Number of times the colon symbol used (specific to csv and tsv)
num pipe Number of times the pipe symbol used (specific to csv and tsv)
num hash Number of times the hash symbol used (specific to csv and tsv)
average line length Average length of a line (in characters)
description File description in short, if available
programming Whether the file is a programming code, if known
stopwords normalised Normalised stop words across Java and Python
file type File Type information
13
A.2 Features Used
Out of the above features, following features are engineered for classification.
’num lines’, ’header info’, ’trailer info’, ’indentation’, ’eol marker’, ’sol marker’, ’isLowercaseMethods’,
’Python comments’, ’Java comments’, ’num def’, ’num returns’, ’if else proximity’, ’num carat’,
’num comma’, ’num fullstop’, ’num tab’, ’num semicolon’, ’num colon’, ’num pipe’, ’num hash’, ’average line length’,
’file type’, ’programming’, ’stopwords normalised’, ’Python keywords’, ’Java keywords’, ’num Python comments normalised’,
’num Java comments normalised’,
’round normalised’, ’curly normalised’, ’square normalised’, ’angular normalised’, ’def return balance’
14