0% found this document useful (0 votes)

25 views

Computational Intelligence To Aid Text F

Uploaded by

rajivfr144

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Computational Intelligence To Aid Text F

Uploaded by

rajivfr144

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Computational Intelligence to aid Text File Format

Identification
Santhilata Kuppili Venkata, Alex Green
The National Archives

Abstract
One of the challenges faced in digital preservation is to identify the file types when
the files can be opened with simple text editors and their extensions are unknown. The
problem gets complicated when the file passes through the test of human readability, but
would not make sense how to put to use! The Text File Format Identification (TFFI)
project was initiated at The National Archives to identify file types from plain text file
contents with the help of computing intelligence models. A methodology that takes help of
AI and machine learning to automate the process was successfully tested and implemented
on the test data. The prototype developed as a proof of concept has achieved up to 98.58%
of accuracy in detecting five file formats.

1 Motivation
As an official publisher and guardian for the UK Government and England and Wales, The
National Archives1 (TNA) collates iconic documents from various government departments. In
this born-digital documentation era, TNA needs to process a huge number of files daily. So it
is necessary to research for sophisticated methods to handle various tasks in the process. File
format identification of plain text files is one such example.

1.1 How a simple plain text file can create confusion?

Figure 1: A sample text file with no file extension

In this digital era, files are often generated in an integrated development environment. Each
document is supported by multiple files. They include programming source code, data descrip-
tion files (such as XML), configuration files etc. Contents of the supporting files are often
1
http://www.nationalarchives.gov.uk

1
human-readable. i.e they can be opened as plain text files using a simple text editor. But if the
file extensions are missed or corrupted, it is hard to know how to put the file to use!! A sample
file with missing extension is shown in the Fig. 1. The file shown here contains some charac-
ters from the Roman alphabet written in a column. At first glance, one would think that this
must be a simple exercise to typewriting characters in order. Someone familiar with the Unix
environment cannot rule out the possibility of this file being a part of a bash script/commands.
To a naive user, they appear to be a set of encrypted commands. Nevertheless, the question re-
mains, how can we make use of the file even though it does not make any sense at the moment?
If we have thousands of such files, it is impossible to examine each file physically, so we need
to automate the process of file type identification.

1.2 How Big is the Problem?

Figure 2: Some prominent file types found in the repositories

To estimate the enormity of the problem, we have cloned files from publicly available
Github repositories of the Government Digital Service (GDS)2 and The National Archives3 .
As TNA handles similar data, it makes sense to use these repositories as a testbed. In all, we
have cloned 1457 public repositories from these two sources. They contain over 410,000 files
representing 928 file types that can be opened with a simple text editor program. Fig. 2 shows
a partial bar graph of file type categories that contain at least 500 files of each format in our
sample dataset.

1.3 Our Starting Point

It is an extensive task to develop a single classifier model that classifies all 928 file types. So
we have grouped files into 14 categories to understand the priority of the file types that need
2
https://github.com/alphagov
3
https://github.com/nationalarchives

2
immediate attention. The distribution of 13 main file categories (in blue) and the corresponding
number of files (in red) for each category available in the corpus is shown in Fig. 3. The 14th
category is a collection of all file types that cannot be categorized as any of the remaining 13
file categories. From the above patterns and DROID [1] reports, we found that programming

Figure 3: Distribution of file types in the corpus

codes and text files are the major file types that TNA receives regularly. So for the initial
experimentation, we started with a small subset (22,292 files) consisting of five file types: two
programming source code file types (Java, Python), plain text files (.txt) and two delimiter
separated file types (.csv and .tsv).
Aside from digital archiving, file type identification is a serious problem in the areas of
digital forensics and cybersecurity. The research in digital forensics is mainly focused on the
identification of image file types and their metadata. While most of the research is targeted for
binary file formats, very little work is done in the plain text file category. Though our target file
types are different, we could adopt their approaches and methods to some extent.
There are mainly two approaches to work with plain text files. The first approach is to
treat the file as a plain text file (no prior knowledge about the file type) and search for spe-
cific characteristics for possible file types. The signature of the file type is a combination of
characteristics of that file type. This is a generic method and can be extended to any number
of file types. However, this approach needs a thorough knowledge of the file type to generate
its characteristical features. The second approach is based on prior knowledge about a file.
For example, if we have an intuition about a file belonging to a programming language, we
could validate the file type by running its compiler(s) or search for specific text patterns corre-

3
sponding programming language. We followed the first approach as our file corpus consists of
a variety of file types. We have developed a flexible methodology (described in section 3) to
reflect this approach. Our initial prototype makes use of machine learning algorithms and can
identify five formats: .py, .java, .txt, .csv, and .tsv.

1.4 Problem Statement

A digital document is often created with the help of several supporting files in an integrated
development environment. A corrupted or misidentified support file fails to reconstruct the
document correctly. Hence it is necessary to identify files correctly. Several tools are available
to identify file formats given the file extension and file’s signature (such as magic bytes). It is
difficult, however, to identify the type of a plain text file without any information. We need to
identify the file from its contents. With thousands of documents to handle, a manual inspection
of the contents of each file is tedious and time-consuming. We need to automate this process.
We can formulate the research question as follows:

Research Question: How to correctly identify the file type of a plain text file from its
contents?

To answer this question, we have carried out an extensive literature survey to review exist-
ing methodologies, approaches formulated and their adaptability from other fields. The survey
of methods and their usability is explained in section 2. Relevant algorithms reconstructed to
our task are explained in section 4. Given the nature of the problem, we narrowed our prob-
lem belonging to the classification category of supervised learning. A Python-based machine
learning prototype was developed to understand the intricacies of different classification mod-
els during the ‘proof of concept’ development phase. The model construction, testing and
evaluation are discussed in section 5.

2 Literature Review
Automated file type identification (AFTI) is a highly researched problem in digital forensics
and related fields. Researchers have concentrated on the identification of image file types with
corrupted metadata and missing chunks from the contents. Methods were developed to recon-
struct damaged files from their fragments. File type identification is using metadata is a widely
accepted paradigm for AFTI. It includes information about file extensions, header/footer sig-
natures [1, 2], binary information such as magic bytes etc. All these methods work well when
the metadata is available and unaltered. However, traditional approaches are not reliable when
the integrity of the metadata is not guaranteed. An alternative paradigm is to generate ‘finger-
prints’ of file types based on the set of known input files and using fingerprints to classify the
type of the unknown file. Another prominent approach is to calculate the centroid for a given
file type from its salient features. Each unknown file is examined for the distance from the
known set of centroids to predict the file type. On the other hand, the centroid paradigm uses
supervised and unsupervised learning techniques to infer a file (object) type classifier by ex-
ploiting unique inherent patterns that describe a file type’s common file structure. Alamri et al.
[3] have published a taxonomy of file type identification. Their work ranges over 30 different
algorithms and approaches. In this section, we review the literature related to predicting file
type from fragments and content-based methods.

4
2.1 File Type Identification from File Fragments
Identification of file type from its fragments is mainly used as a recovery technique. It allows
file recovery of the file (or rebuild) without contextual information or metadata. This process is
also referred to as ‘file carving’ in some of the literature. Image type files are mainly targeted
by this technique.
Calhoun et al. [4] investigated two algorithms for predicting the type from fragments in
computer forensics. They have performed experiments on the fragments that do not contain
header information. First algorithm was based on the linear discriminant and the second was
based on the longest common sub-sequences of fragments. Their work provided various rele-
vant statistics such as byte frequency, entropy, etc. as features to predict the file type. Ahmed
et al. [5, 6] also published two techniques to identify the file types from file fragments. These
techniques aim to reduce the time consumed to process the contents. Their first technique se-
lects a subset of features describing the frequency of occurrence of certain fragments. The
second technique speeds up classification by randomly sampling file blocks. They have per-
formed experiments on .png, .jpg and .tiff file types. Poisel et al. [7, 8] published a compre-
hensive survey of file carving research to detect the file types from their fragments. They have
also provided a file carving ontology useful for researchers. In a similar work, Evensen et al.
[9] explored the use of the naive Bayes classifier combined with n-gram analysis of byte se-
quences in files to correctly identify the file type. Gopal et al. [10] presented the evaluation and
analysis of the robustness of Support Vector Machine (SVM) and k-Nearest Neighbours (kNN)
in handling damaged files and file segments. They have restricted their study to the file type
identification from the metadata. The evaluation reveals that SVM and kNN learn better than
any commercial off-the-shelf tools developed based on file extensions. In his thesis, Wilgen-
bus [11] presented a combined multi-layer perceptron neural network and linear programming
discriminant classifiers for the multiple class file fragment type identification problems. This
solution could help our text file format identification problem, as neural networks learn from
features of the contents and helps in classification of discrete file types. In their work, Karam-
pidis et al. [12, 13] examine a three-stage methodology for AFTI, using feature selection (Byte
Frequency Distribution) and feature selection using genetic algorithm. They have tested with
classification models including decision tree, SVM, neural network, logistic regression and
kNN. Their methodology showed that artificial neural networks performed with a very high
and exceptional accuracy in most cases.

2.2 Content-based File Type Identification

Content-based file type detection methods have proved to be robust and more accurate so far.
They are built on the principle of extracting features from the files. Initial work on content-
based file type identification [14, 15] was based on three algorithms: byte frequency analysis,
byte frequency cross-correlation and File header/trailer analysis.
Li et al. [16] have provided improvements to these algorithms by generating file prints
(file signatures) using K-means algorithm with Manhattan distance metric. They produced
file prints with the help of the statistical features extracted and selected. The file prints are
also mentioned as ‘centroid’ in literature. An unknown file is tested against a set of known
centroids. The distance between the centroids is compared to predict the possible file type. The
Mahalanobis distance metric is deployed for the comparison. The file prints (centroids) are
developed using Natural Language Processing (NLP) techniques such as pattern matching of
n-gram contiguous sequence models. While their work is a pioneer in its kind, their approach
restricts the input file to follow a specific style only. Also, they fail to differentiate files when

5
the target file types have almost similar structures. For example, Java and C programming
source codes. We need to generate file features and classification models in such a way that
they describe file types distinctly.
Other improvements in this area include neural networks [17] and Byte Frequency Distri-
bution (BFD) to classify file types [18, 19]. Amirani et al. [20] proposed a content-based file
type detection method for files normalised using BFD. Their model uses principal component
analysis for feature selection. The model is then fed into an auto-associative unsupervised
neural network. Mitlohner et al. [21] published a comprehensive study of characteristics of
open data CSV files. Their work analyzes an open data corpus containing resources from a
data consumer perspective. Their study provided a deep insight to feature engineering CSV
file type.
Predicting the file type from the contents of text files complicates the problem of AFTI.
Though several approaches are available, they are highly domain-specific. Hence we could not
use them for the identification of file types from their contents. We need to research generic
methods to fill this gap based on existing approaches.

3 Methodology
TNA deals with a huge variety of file types for digital archiving. Hence an iterative process
model is appropriate to include file features gradually. The methodology should be flexible to
add more file types progressively. As and when a new file type is to be included, its features
(specific characteristics) should be compared against the existing features of other file types
and engineered to add to the list. Hyperparameters for the models should be tuned to get a
better performance. The flow graph in Figure 4 shows the methodology developed and used by
us.

Fil
e
c
o r
pus

Exter
nal Fe
ature Fe
atur
e De
velopa
r
esourc
es e
xtrac
tion e
ngi
neer
ing cl
ass
if
ier

Te
st
Hype
rpar
ame
termodi
fi
cat
ion

Figure 4: Methodology to include file types progressively

a) The File Corpus is the set of files that serve as the dataset to the file type identification
task. As TNA is set to receive digital documents from various government departments,
it was decided to use data files from the Github repositories of the Government Digital
Service. We have also collected file samples from TNA’s Github repositories to compile
the data file corpus for this prototype.

b) External resources comprise published methodologies, approaches and algorithms used

6
in various environments. For example, DROID - the binary file format identifier tool is
used to eliminate known file types as a first step.

c) Feature Extraction Features of a file play an important role in the identification of their
types. Feature extraction is the process of extracting features from the files in the cor-
pus. Characteristic features that determine the styles and nature of the file type will be
extracted during this step.

d) Feature Engineering is the process of using domain knowledge of the data to create
features that make machine learning algorithms work. It helps to fine tune the machine
learning models by reducing the computational processing overhead.

e) Classifier Development and Test Machine learning (ML) is chosen to develop a classifier.
ML algorithms are used to understand and extract the patterns from the data and help to
predict the outcome.

4 Data Pre-processing
Text file format identification is a non-linear learning problem given the correlation between
features of different file types. For example, a very high correlation between the Java and
Python programming file structures lead their features to be interdependent. Similarly, .csv
and .tsv files share their file features. Many times a comma separated files (.csv) may contain
unformatted textual lines, leaving a very small difference between .txt and .csv file formats to
differentiate. The following phases in the life-cycle of the development of the model mirror
these facts.

4.1 Feature Extraction

Our data consists of unstructured text files. So the first phase was to recognise features that
describe Python and Java source codes and .txt, .csv and .tsv files correctly. We automated
the process of feature extraction from each file to make the process uniform across all file
types. The feature extraction process scans through each token (word) in the file and creates a
quantifiable feature set. A total of 45 features were identified for extraction from each file.

4.2 Feature Engineering

Some of the facts that help to find relevant features during the feature engineering phase are as
given below.

• a Python source code file differs from a Java source code file by its commenting style,
strict indentation requirement at the beginning of each line of the code, use of specific
keywords etc.

• the Java programming source code needs to follow a pre-defined structure to be able to
compile successfully.

• while every line in the Java code needs to be ended with a ’;’ (semi-colon), python does
not need any.

• even though .csv and .tsv files are largely categorised as text-based, they can be recog-
nised by the use of the number of commas (or other delimiter characters). Hence a

7
comparison of the number of commas between files could become a deciding factor to
classify a file between .csv and .txt.

• in general, a .txt file got no restrictions on how to create one compared to .csv or .py.
It is difficult to extract a pattern from a normal .txt file. Hence we have used the count
of ‘stopwords’ (common words in English) from the NLTK library. We started with the
assumption that the usage of stopwords is more in normal text files than programming
codes or data files.

• another significant characteristic is the ’word-combination’ proximity. It differs from n-

gram contiguous model. For example, the combination of words such as <def-return>,
<if-then-else> etc. are used more closely in the programming codes than a .txt file. So,
we derived a threshold for the word-combination word sets.

After feature engineering, 33 features were selected for classification. Features extracted
and used for classification are listed in the Appendix.

4.3 Selection of Classification Models

There are four prominent categories of classification algorithms. They are (i) Linear models,
(ii) Tree-based algorithms, (iii) k-nearest algorithms and (iv) Neural network based. Since our
problem has discrete outputs and non-linear inputs, the linear models are omitted. All models
were trained and hyperparameters were tuned to improve the accuracy over many iterations. A
detailed explanation of the training and testing is explained in section 5.

5 Classification Models & Evaluation

The Jupyter notebooks developed as a proof of concept is available here4 .

5.1 The Decision Tree Classifier

A decision tree [22] is a flowchart-like tree structure where an internal node represents a feature
(or attribute). The branch represents a decision rule, and each leaf node represents the outcome.
Each parent node learns to partition the data based on the attribute value. It partitions the tree
recursively until all the data in the partition belongs to a single class. Decision trees can handle
high dimensional data with good accuracy. The decision tree is implemented with the help of
the Python scikit-learn library. The decision tree algorithm is explained in Algorithm 1.

Algorithm 1 Decision Tree Algorithm

1: Place the best attribute of our dataset at the root of the tree.
2: Split the training set into subsets. Subsets should be made in such a way that each subset
contains data with the same value for an attribute.
3: Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the
tree.

4
https://github.com/nationalarchives/Text-File-Format-Identification

8
Results:
The evaluation of the decision tree classification is shown in the confusion matrix in Table 1.
The train-to-test ratio is set ideally as 80:20 to achieve better accuracy. Though the accuracy
of classification is very high, we consider the precision metric more, given the non-uniform
distribution of file types in the file corpus.

Table 1: Accuracy and precision metrics for Decision Tree classification model

Accuracy 98.58%
Precision score 86%

5.2 The k-Nearest Neighbour Classifier

The k-Nearest Neighbour [22] (kNN) is based on feature similarity that determines how we
classify a given data point. The output is a class membership (predicts a class — a discrete
value). An object is classified by a majority vote of its neighbours, with the object being
assigned to the class most common among its k-nearest neighbours. The kNN algorithm5 is
explained in Algorithm 2.

Algorithm 2 k-Nearest Neighbour Algorithm

1: Load the data
2: Initialize k to be a chosen number of neighbours
3: For each data point a) Calculate the distance between the data point and the example in the
train data
4: Sort the calculated distances in ascending order and choose top k rows
5: Get the most frequent class as the predicted class

Results:
The kNN classification evaluation is shown in a confusion matrix in Table 2. The train-to-test
ratio is set ideally to 80:20 to achieve better accuracy. The ‘minkowski’ distance metric6 is
used to establish the distance between classes. The value for k is set to 3. Due to the uneven
distribution of the file types in the file corpus, though the accuracy is 94%, the precision score
should be considered more.
Table 2: Accuracy and precision metrics k-nearest neighbour classification model

Accuracy 94.03%
Precision score 80%

5.3 The Multilayer Perceptron based Classifier

A Multilayer perceptron (MLP) is a deep, artificial neural network. It is a non-linear classifica-
tion model. It is composed of more than one perceptron [23, 24]. Each perceptron is composed
5
https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/k-nearest-neighbor
6
https://www.sciencedirect.com/topics/computer-science/minkowski-distance

9
of an input layer to receive the signal, an output layer that makes a decision or prediction about
the input, and between these two, an arbitrary number of hidden layers that are the true com-
putational engine. MLPs with one hidden layer are capable of approximating any continuous
function. They train on a set of input-output pairs and learn to model the correlation between
those inputs and outputs. Training involves adjusting the parameters, or the weights and biases,
of the model, in order to minimize error. Backpropagation is used to make those weight and
bias adjustments relative to the error, and the error itself can be measured in a variety of ways.
A neural network executes in two phases: Feed-forward and Backpropagation.

Feed-forward These are the steps performed during the feed-forward phase:

1. The values received in the input layer are multiplied with the weights. A bias is added to
the summation of the inputs and weights.

2. Each neuron in the first hidden layer receives different values from the input layer de-
pending upon the weights and bias. Neurons have an activation function that operates
upon the value received from the input layer.

3. The outputs from the first hidden layer of neurons are multiplied with the weights of the
second hidden layer; the results are summed together and passed to the neurons of the
proceeding layers. This process continues until the outer layer is reached. The values
calculated at the outer layer are the actual outputs of the algorithm.

1 1 1 5node
s

2 2 2 1
I
nputf
il
es Out
putc
las
sif
ica
tionr
esul
t
3 3 3 2

[
[0,
0,
1,
0,
0],
[1,
0,
0,
0,
0],
….]
3

31 10 10 4

32 11 11 5

33 12 12

I
nputl
aye
r Hi
dde
nla
yer
s Out
putl
aye
r

Figure 5: Multilayer perceptron neural network used for classification

Back Propagation The predicted output is not necessarily correct right away after the feed-
forward phase. To improve these predicted results, a neural network will go through a back-
propagation phase. During backpropagation, the weights of neurons are updated in a way that
the difference between the desired and predicted output is as small as possible.
The MLP model generated for our classification problem was designed as a 3-layer fully
connected neural network with 33 nodes in the input layer, 12 nodes each in the hidden layers

10
and 5 nodes (one for each of the output class) in the output layer. The number of nodes in each
of the layers was decided on a trial and error basis. Our MLP model is shown in Figure 5. The
parameters set for the MLP is given in Table 3.
Table 3: The values set for MLP parameters

Parameter Value set to

activation (input layer/hidden layer 1) relu
activation (hidden layer 1/hidden layer 2) relu
activation (hidden layer 2/output layer) softmax
optimizer adam
loss type categorical crossentropy
metrics accuracy
# epochs 30
batch size 20

Results:
The MLP neural network classification results are presented in a confusion matrix in Table 4.
The train-to-test ratio is set ideally as 80:20 to achieve better accuracy. The train-test set is kept
the same across all classification methods. The test accuracy 97.80% is almost as good as the
decision tree.
Table 4: Accuracy MLP Neural network classification model

test - Accuracy 97.37%

test - loss 0.15
train - Accuracy 97.80%
train - loss 0.13

6 Conclusion & Scope

TNA has initiated the project:‘Text File Format Identification’ to identify file formats for plain
text files. A prototype was developed using computational intelligence models. The GitHub
repositories of the Government Digital Service and The National Archives were used for testing
and training purposes. This prototype has achieved 98.58% accuracy (with a decision tree
classifier model) in identifying five file formats. Python and Java programming code file types
were identified with higher accuracy compared to text file formats. This leads to the assumption
that the inherent structure of programming files has been captured better than .txt/.csv/.tsv files
with no fixed structure. In comparison, .tsv file identification has not fared well due to the low
representation of .tsv files in the input dataset. The decision tree model has performed better
than the other two models. We have successfully established a methodology which is flexible
enough to work with more file formats in future.
In future, we could focus on revising the dominant feature identification for .csv and .tsv
file types. In this project, we assumed that each .csv file contained only one table. However,
it is possible that multiple tables exist within a single .csv file. This issue should be investi-
gated. Even though the current prototype works well for the five file types, a revision of feature
engineering will be necessary whenever a new file type is included.

11
Acknowledgements
To Paul Young and Ian Henderson, for their deep insights and support at every stage.

References
[1] DROID. http://droid.sourceforge.net/, 2013.
[2] TrID. http://mark0.net/soft-trid-e.html.
[3] Nasser S. Alamri and William H. Allen. A taxonomy of file-type identification techniques. In Proceedings
of the 2014 ACM Southeast Regional Conference, ACM SE ’14, page 49:1–49:4, New York, NY, USA, 2014.
ACM.
[4] William C. Calhoun and Drue Coles. Predicting the types of file fragments. Digit. Investig., 5:S14–S20,
September 2008.
[5] Irfan Ahmed, Kyung suk Lhee, Hyunjung Shin, and ManPyo Hong. Content-based file-type identification
using cosine similarity and a divide-and-conquer approach. IETE Technical Review, 27(6):465, 2010.
[6] Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin, and Man-Pyo Hong. Fast content-based file type identifica-
tion. In Advances in Digital Forensics VII, page 65–75. Springer Berlin Heidelberg, 2011.
[7] Rainer Poisel and Simon Tjoa. A comprehensive literature review of file carving. In 2013 International
Conference on Availability, Reliability and Security. IEEE, sep 2013.
[8] Rainer Poisel, Marlies Rybnicek, and Simon Tjoa. Taxonomy of data fragment classification techniques. In
Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineer-
ing, page 67–85. Springer International Publishing, 2014.
[9] John Daniel Evensen, Sindre Lindahl, and Morten Goodwin. File-type detection using naive bayes and n-gram
analysis. In 2014: NISK 2014, 2014.
[10] Siddharth Gopal, Yiming Yang, Konstantin Salomatin, and Jaime Carbonell. Statistical learning for file-type
identification. In 2011 10th International Conference on Machine Learning and Applications and Workshops.
IEEE, dec 2011.
[11] Erich Feodor Wilgenbus. The file fragment classification problem : a combined neural network and linear
programming discriminant model approach. Master’s thesis, N, 2013.
[12] Konstantinos Karampidis, Ergina Kavallieratou, and George Papadourakis. Comparison of classification al-
gorithms for file type detection a digital forensics perspective. Polibits, 56:15–20, 2017.
[13] Konstantinos Karampidis and Giorgos Papadourakis. File type identification - computational intelligence for
digital forensics. The Journal of Digital Forensics, Security and Law, 2017.
[14] Mason McDaniel and M.Hossain Heydari. Content based file type detection algorithms. In 36th Annual
Hawaii International Conference on System Sciences, 2003. Proceedings of the. IEEE, 2003.
[15] M. McDaniel. Automatic file type detection algorithm. Master’s thesis, 2001.
[16] W. J. Li, S. J. Stolfo, and B. Herzog. Fileprints: identifying file types by n-gram analysis. In Proceedings
from the Sixth Annual IEEE SMC Information Assurance Workshop, page 64–71, June 2005.
[17] J. G. Dunham and J. C. R. Tseng. Classifying file type of stream ciphers in depth using neural networks.
In The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005., page 97–, Jan
2005.
[18] M. Karresand and N. Shahmehri. File type identification of data fragments by their binary structure. In 2006
IEEE Information Assurance Workshop, page 140–147, June 2006.
[19] L. Zhang and G. B. White. An approach to detect executable content for anomaly based network intrusion
detection. In 2007 IEEE International Parallel and Distributed Processing Symposium, page 1–8, March
2007.
[20] Mehdi Chehel Amirani, Mohsen Toorani, and Sara Mihandoost. Feature-based type identification of file
fragments. Security and Communication Networks, 6(1):115–128, apr 2012.
[21] J. Mitlöhner, S. Neumaier, J. Umbrich, and A. Polleres. Characteristics of open data csv files. In 2016 2nd
International Conference on Open and Big Data (OBD), page 72–79, Aug 2016.
[22] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison Wesley, us ed
edition, May 2005.

12
[23] H. Ramchoun, M. A. Janati Idrissi, Y. Ghanou, and M. Ettaouil. Multilayer perceptron: Architecture opti-
mization and training with mixed activation functions. In Proceedings of the 2Nd International Conference
on Big Data, Cloud and Applications, BDCA’17, page 71:1–71:6, New York, NY, USA, 2017. ACM.
[24] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016.

A Features generated from files

A.1 Original features generated

Table 5: Features extracted

Feature Description
file name Name of the file along with its complete path
file extension File extension if available
num lines Number of lines in the file separated by newline character
header info File header information if available
trailer info Trailer information, if available
indentation Number of spaces used for indentation (specific to Python)
eol marker End-of-line markers, if any (specific to Java)
sol marker Start-of-line markers, if any
isLowercaseMethods Whether methods/functions start with lower case alphabets
num stopwords Number of stop words used (specific to text files)
num Python keywords Number of Python key words within the file
num Java keywords Number of Java key words used in the file
Python comments Number of Python style of comments
Java comments Number of Java style of comments
angular brackets Number of angular brackets used
curly brackets Number of curly brackets used
round brackets Number of round brackets used
square brackets Number of square brackets used
num def Number of ’def’ used (specific to Python
num returns Number of times the key word ’return’ used
if else proximity Number of words between if and else (specific to programming codes)
num carat Number of times the carat symbol used (specific to csv and tsv)
num comma Number of times the comma symbol used (specific to csv and tsv)
num fullstop Number of times the fullstop symbol used (specific to csv and tsv)
num tab Number of times the tab used (specific to csv and tsv)
num semicolon Number of times the semi colon symbol used (specific to csv and tsv)
num colon Number of times the colon symbol used (specific to csv and tsv)
num pipe Number of times the pipe symbol used (specific to csv and tsv)
num hash Number of times the hash symbol used (specific to csv and tsv)
average line length Average length of a line (in characters)
description File description in short, if available
programming Whether the file is a programming code, if known
stopwords normalised Normalised stop words across Java and Python
file type File Type information

13
A.2 Features Used
Out of the above features, following features are engineered for classification.
’num lines’, ’header info’, ’trailer info’, ’indentation’, ’eol marker’, ’sol marker’, ’isLowercaseMethods’,
’Python comments’, ’Java comments’, ’num def’, ’num returns’, ’if else proximity’, ’num carat’,
’num comma’, ’num fullstop’, ’num tab’, ’num semicolon’, ’num colon’, ’num pipe’, ’num hash’, ’average line length’,
’file type’, ’programming’, ’stopwords normalised’, ’Python keywords’, ’Java keywords’, ’num Python comments normalised’,
’num Java comments normalised’,
’round normalised’, ’curly normalised’, ’square normalised’, ’angular normalised’, ’def return balance’

Escalation Matrix - Facility Team (TMS) : Facility Helpdesk (Regular Business)
100% (1)
Escalation Matrix - Facility Team (TMS) : Facility Helpdesk (Regular Business)
1 page
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Content Based File Type Detection Algorithms
No ratings yet
Content Based File Type Detection Algorithms
10 pages
Python File Handling Made Easy: A Practical Guide with Examples
From Everand
Python File Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Content Based File Type Detection Algorithms
No ratings yet
Content Based File Type Detection Algorithms
11 pages
Software Needs
No ratings yet
Software Needs
3 pages
File Formats- Characterization and Validation
No ratings yet
File Formats- Characterization and Validation
6 pages
C.1 FileTypeIdentification ALiteratureReview
No ratings yet
C.1 FileTypeIdentification ALiteratureReview
6 pages
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Java File Handling Step by Step: A Practical Guide with Examples
From Everand
Java File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
C.1 FileTypeIdentification ALiteratureReview
No ratings yet
C.1 FileTypeIdentification ALiteratureReview
6 pages
File Type Detection Technology
No ratings yet
File Type Detection Technology
12 pages
Config File Types
From Everand
Config File Types
Frank Wellington
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Fileprints Identifying File Types by n-gram Analysis
No ratings yet
Fileprints Identifying File Types by n-gram Analysis
8 pages
Information Extraction: Fundamentals and Applications
From Everand
Information Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Content Based Automated File Organization Using Ma
No ratings yet
Content Based Automated File Organization Using Ma
16 pages
File Handling 1
No ratings yet
File Handling 1
6 pages
The Science of Managing Our Digital Stuff
From Everand
The Science of Managing Our Digital Stuff
Ofer Bergman
3.5/5 (3)
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
INI Format Explained
From Everand
INI Format Explained
Isabella Ramirez
No ratings yet
Class Xii File Handling
No ratings yet
Class Xii File Handling
14 pages
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
11Mar_Mayer
No ratings yet
11Mar_Mayer
106 pages
Mastering Python in 7 Days
From Everand
Mastering Python in 7 Days
Alex Wood
No ratings yet
Identifying Unnecessary Files
No ratings yet
Identifying Unnecessary Files
16 pages
Managing Multimedia and Unstructured Data in the Oracle Database
From Everand
Managing Multimedia and Unstructured Data in the Oracle Database
Marcelle Kratochvil
No ratings yet
What Is A File?
No ratings yet
What Is A File?
10 pages
Pronom File Signature Research
No ratings yet
Pronom File Signature Research
19 pages
Python Data Persistence
From Everand
Python Data Persistence
Malhar Lathkar
No ratings yet
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mastering Computer Programming: A Comprehensive Guide
From Everand
Mastering Computer Programming: A Comprehensive Guide
Kondwani Hara
No ratings yet
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
2.5/5 (2)
Short Text Classification Approach To Identify Child Sexual Exploitation Material
No ratings yet
Short Text Classification Approach To Identify Child Sexual Exploitation Material
9 pages
Best Free Open Source Data Recovery Apps for Mac OS English Edition
From Everand
Best Free Open Source Data Recovery Apps for Mac OS English Edition
Cyber Jannah Sakura
No ratings yet
File Format: 2 Patents
No ratings yet
File Format: 2 Patents
7 pages
Unit 4
No ratings yet
Unit 4
24 pages
F 12 CH 04 TEXT FILE HANDLING 1
No ratings yet
F 12 CH 04 TEXT FILE HANDLING 1
111 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Data Type in Python
No ratings yet
Data Type in Python
20 pages
Touchpad Play Ver 2.0 Class 7: Windows 10 & MS Office 2016
From Everand
Touchpad Play Ver 2.0 Class 7: Windows 10 & MS Office 2016
Team Orange
No ratings yet
Presentation On Use of Files. Programming
No ratings yet
Presentation On Use of Files. Programming
46 pages
Better Unpacking Binary Files Using Contextual Information
No ratings yet
Better Unpacking Binary Files Using Contextual Information
5 pages
Artificial Intelligence Frame: Fundamentals and Applications
From Everand
Artificial Intelligence Frame: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecs 102
No ratings yet
Lecs 102
20 pages
Thuyet Trinh
No ratings yet
Thuyet Trinh
28 pages
Basic Principles of an Operating System: Learn the Internals and Design Principles
From Everand
Basic Principles of an Operating System: Learn the Internals and Design Principles
Priyanka Rathee
No ratings yet
Importing data in python
No ratings yet
Importing data in python
13 pages
GE3151 PS and Python Programming - UNIT V
No ratings yet
GE3151 PS and Python Programming - UNIT V
28 pages
File Handling3
No ratings yet
File Handling3
38 pages
Data-Driven Security: Analysis, Visualization and Dashboards
From Everand
Data-Driven Security: Analysis, Visualization and Dashboards
Jay Jacobs
No ratings yet
lecs102
No ratings yet
lecs102
21 pages
Python Data File Handling XII CS 2022-23 As On 28-10-2022
No ratings yet
Python Data File Handling XII CS 2022-23 As On 28-10-2022
62 pages
The Future of .txt Files
No ratings yet
The Future of .txt Files
3 pages
Chapter 4 File Handlinf Final (New)
No ratings yet
Chapter 4 File Handlinf Final (New)
78 pages
Lecs 102
No ratings yet
Lecs 102
20 pages
Flow Rate
No ratings yet
Flow Rate
3 pages
Alu Life Cycle 1
No ratings yet
Alu Life Cycle 1
66 pages
Artificial Intelligence's Relevance For Energy Optimization, Companies and Business Internationalization
No ratings yet
Artificial Intelligence's Relevance For Energy Optimization, Companies and Business Internationalization
9 pages
Irjet V10i234
No ratings yet
Irjet V10i234
21 pages
Life Cycle Assessment
No ratings yet
Life Cycle Assessment
28 pages
COVID 19 FAQs
No ratings yet
COVID 19 FAQs
20 pages
Food Business Continuity Management
No ratings yet
Food Business Continuity Management
6 pages
Managing Human Error
No ratings yet
Managing Human Error
11 pages
Organisation (Name) Department (Name)
No ratings yet
Organisation (Name) Department (Name)
25 pages
Manpower Needs Request Form
No ratings yet
Manpower Needs Request Form
1 page
Session 8: Measuring and Communicating On Sustainable Supply Chain Performance
No ratings yet
Session 8: Measuring and Communicating On Sustainable Supply Chain Performance
47 pages
New Employees Induction Training Form
No ratings yet
New Employees Induction Training Form
1 page
Used Cooking Oil
No ratings yet
Used Cooking Oil
2 pages
Business Card Request
No ratings yet
Business Card Request
1 page
Iso 14001 2015
No ratings yet
Iso 14001 2015
6 pages
ISO 45001 Checklist
100% (1)
ISO 45001 Checklist
5 pages
Risk MGMT Flyer COLOR For Asq
No ratings yet
Risk MGMT Flyer COLOR For Asq
1 page
Deep - Learning - With - Edge - Computing - A - Review 2023
No ratings yet
Deep - Learning - With - Edge - Computing - A - Review 2023
20 pages
T 2V: D R T: OP EC Istributed Epresentations of Opics
No ratings yet
T 2V: D R T: OP EC Istributed Epresentations of Opics
25 pages
Summary of Ids-Ips Research Papers
No ratings yet
Summary of Ids-Ips Research Papers
73 pages
Fraud Detection Model Thesis
No ratings yet
Fraud Detection Model Thesis
217 pages
Deep4SNet: Deep Learning For Fake Speech Classification
No ratings yet
Deep4SNet: Deep Learning For Fake Speech Classification
12 pages
Forecasting Exchange Rates Using General Regression Neural Networks
No ratings yet
Forecasting Exchange Rates Using General Regression Neural Networks
18 pages
Federated Deep Learning For Monkeypox Disease Detection On GAN-Augmented Dataset
No ratings yet
Federated Deep Learning For Monkeypox Disease Detection On GAN-Augmented Dataset
11 pages
XOR Problem Demonstration Using MATLAB
0% (1)
XOR Problem Demonstration Using MATLAB
19 pages
MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE IN BA
No ratings yet
MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE IN BA
87 pages
Machine Learning and Real-World Applications
100% (1)
Machine Learning and Real-World Applications
19 pages
Business Intelligence and Decision Support Systems (9 Ed., Prentice Hall)
No ratings yet
Business Intelligence and Decision Support Systems (9 Ed., Prentice Hall)
41 pages
Deep Learning-Based Strategies For Integrated Autonomous Navigation A Review
No ratings yet
Deep Learning-Based Strategies For Integrated Autonomous Navigation A Review
6 pages
Bahan Ajar Pemodelan Dan Identifikasi Sistem PDF
No ratings yet
Bahan Ajar Pemodelan Dan Identifikasi Sistem PDF
5 pages
Neural Network
No ratings yet
Neural Network
37 pages
ML Lab Manual (final) dtu
No ratings yet
ML Lab Manual (final) dtu
52 pages
A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends
No ratings yet
A Survey On Self-Supervised Learning: Algorithms, Applications, and Future Trends
20 pages
L12_intro-cnn-part1__slides
No ratings yet
L12_intro-cnn-part1__slides
56 pages
AI PERA
No ratings yet
AI PERA
10 pages
Full Stuck Software Developer
No ratings yet
Full Stuck Software Developer
45 pages
MLT unit 4 (1)
No ratings yet
MLT unit 4 (1)
15 pages
Module 5
No ratings yet
Module 5
72 pages
EC 8th SEM AICTE121220115934
No ratings yet
EC 8th SEM AICTE121220115934
15 pages
Lecture 8- Artificial Neural Networks
No ratings yet
Lecture 8- Artificial Neural Networks
41 pages
End-To-End Learning of Driving Models From Large-Scale Video Datasets
No ratings yet
End-To-End Learning of Driving Models From Large-Scale Video Datasets
9 pages
Stock Price Prediction Using LSTM
0% (1)
Stock Price Prediction Using LSTM
29 pages
BSC Fully Distributed Representation Kanerva 1997
No ratings yet
BSC Fully Distributed Representation Kanerva 1997
8 pages
2.15 Model-Free Adaptive (MFA) Control: G. S. Cheng
No ratings yet
2.15 Model-Free Adaptive (MFA) Control: G. S. Cheng
10 pages
Chatbot Research Paper
No ratings yet
Chatbot Research Paper
5 pages
REPORT -MINOR PROJECT
No ratings yet
REPORT -MINOR PROJECT
22 pages
SSRN 3335536
No ratings yet
SSRN 3335536
35 pages

Computational Intelligence To Aid Text F

Uploaded by

Computational Intelligence To Aid Text F

Uploaded by

Computational Intelligence to aid Text File Format

1.1 How a simple plain text file can create confusion?

Figure 1: A sample text file with no file extension

1.2 How Big is the Problem?

Figure 2: Some prominent file types found in the repositories

1.3 Our Starting Point

Figure 3: Distribution of file types in the corpus

1.4 Problem Statement

2.2 Content-based File Type Identification

Figure 4: Methodology to include file types progressively

b) External resources comprise published methodologies, approaches and algorithms used

4.1 Feature Extraction

4.2 Feature Engineering

• another significant characteristic is the ’word-combination’ proximity. It differs from n-

4.3 Selection of Classification Models

5 Classification Models & Evaluation

5.1 The Decision Tree Classifier

Algorithm 1 Decision Tree Algorithm

5.2 The k-Nearest Neighbour Classifier

Algorithm 2 k-Nearest Neighbour Algorithm

5.3 The Multilayer Perceptron based Classifier

Figure 5: Multilayer perceptron neural network used for classification

Parameter Value set to

test - Accuracy 97.37%

6 Conclusion & Scope

A Features generated from files

Table 5: Features extracted

You might also like