0% found this document useful (0 votes)

16 views15 pages

Research Paper 5 2004

This document summarizes research on using file attributes known at creation time, such as name, permissions, and owner, to predict future file properties like size, lifespan, and access pattern. The researchers collected NFS traces and found strong relationships between attributes and properties. They developed a prediction model called ABLE that uses classification trees to predict properties from attributes with accurate results. Predictions could enable optimizations like improved block layout to enhance locality of reference.

Uploaded by

Shubham Pitale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views15 pages

Research Paper 5 2004

Uploaded by

Shubham Pitale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2902953

Attribute-Based Prediction of File Properties

Article · January 2004

Source: CiteSeer

CITATIONS READS
20 65

4 authors, including:

Daniel Ellard Michael P. Mesnier

Raytheon BBN Technologies Intel
31 PUBLICATIONS 624 CITATIONS 27 PUBLICATIONS 1,767 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Curveball View project

Federated File Syste View project

All content following this page was uploaded by Michael P. Mesnier on 02 February 2015.

The user has requested enhancement of the downloaded file.

Attribute-Based Prediction of File Properties

Daniel Ellard, Michael Mesnier, Eno Thereska, Gregory R. Ganger, Margo Seltzer

Abstract during training online

File attributes
What properties at create time
We present evidence that attributes that are known to to predict?
the file system when a file is created, such as its name, .#pico, 644, uid, ...

permission mode, and owner, are often strongly related FS activity feedback
to future properties of the file such as its ultimate size, Model Model
Generator
lifespan, and access pattern. More importantly, we show (NFS, Local)
that we can exploit these relationships to automatically
generate predictive models for these properties, and that Predictions
of wanted properties
these predictions are sufficiently accurate to enable opti-
mizations.
Figure 1: Using file attributes to predict file properties.
During the training period a predictor for file proper-
ties (i.e., lifespan, size, and access pattern) is constructed
1 Introduction from observations of file system activity. The file sys-
tem can then use this model to predict the properties of
newly-created files.
In “Hints for Computer System Design,” Lampson
tells us to “Use hints to speed up normal execution.” [14]
The file system community has rediscovered this prin- predictions are accurate. Finally, we discuss uses for
ciple a number of times, suggesting that hints about a such predictions, including an implementation of a sys-
file’s access pattern, size, and lifespan can aid in a va- tem that uses them to improve file layout by anticipating
riety of ways including improving the file’s layout on which blocks will be the most frequently accessed and
disk and increasing the effectiveness of prefetching and grouping these blocks in a small area on the disk, thereby
caching. Unfortunately, earlier hint-based schemes have improving reference locality.
required the application designer or programmer to sup-
ply explicit hints using a process that is both tedious and The rest of this paper is organized as follows: Sec-
error-prone, or to use a special compiler that can recog- tion 2 discusses related work. Section 3 describes the
nize specific I/O patterns and automatically insert hints. collection of NFS traces we analyze in this study. Sec-
Neither of these schemes have been widely adopted. tion 4 makes the case for attribute-based predictions by
presenting a statistical analysis of the relationship be-
In this paper, we show that applications already give tween attributes of files and their properties. Section 5
useful hints to the file system, in the form of file names presents ABLE, a classification-tree-based predictor for
and other attributes, and that the file system can success- several file properties based on their attributes. Section 6
fully predict many file properties from these hints. discusses how such models might be used, and demon-
We begin by presenting statistical evidence from three strates an example application which increases the local-
contemporary NFS traces that many file attributes, such ity of reference for on-disk block layout. Section 7 con-
as the file name, user, group, and mode, are strongly re- cludes.
lated to file properties including file size, lifespan, and
access patterns. We then present a method for automati-
cally constructing tree-based predictors for the properties
2 Related Work
of a file based on these attributes and show that these

Harvard University Division of Engineering and Applied Sciences.

Parallel Data Laboratory, Carnegie Mellon University.

As the gap between I/O and CPU performance has

Intel Corporation and Parallel Data Laboratory, Carnegie Mellon increased many efforts have attempted to address it. An
University. entire industry and research community has emerged to
1
attack I/O performance; file systems have been modified, or incomplete file system heuristics is for applications
rewritten and rethought in attempts to reduce the number to supply hints to the file system about the files’ antici-
of synchronous disk requests. Significant effort has also pated access patterns. In some contexts these hints can
been expended to make caches more effective so that the be extremely successful, especially when used to guide
number of disk requests can be reduced. Many powerful the policies for prefetching and selective caching [5, 25].
heuristics have been discovered, often from the analyses The drawback of this approach is that it requires that
of real workloads, and incorporated into production file applications be modified to provide hints. There has
systems. All of these endeavors have been productive, been work in having the compiler automatically generate
but I/O performance is still losing ground to CPU, mem- hints, but success in this area has been largely confined to
ory, and network performance, and we have not resolved scientific workloads with highly regular access patterns
the I/O crisis to which Patterson refers in the original [21], and no file system that uses these ideas has been
RAID paper, written more than fifteen years ago [24]. widely deployed.
There is extensive ongoing research in the file system In previous work, we noted that for some workloads,
and database communities regarding the optimization of applications (and the users of the applications) already
various aspects of performance, reliability, and availabil- provide hints about the future of the files that they create
ity of data access. Many heuristics have been developed via the names they choose for those files [8]. In this paper
and incorporated into popular file systems like the Fast we generalize this finding and show that file names, as
File System (FFS) [17]. Many of these heuristics depend well as other attributes such as uid and mode, are, in fact,
on assumptions about workloads and file properties. hints that may be useful to the file system.
One example of a contemporary file system is the Fast In addition to static analyses of workloads, there has
File System (FFS) [17], whose basic design is nearly been research aimed at understanding the dynamic be-
twenty years old and yet continues to be tuned [7]. For haviors of files. Previous work has shown that proper-
example, FFS is optimized to handle small files in a dif- ties of files depend on the applications and users access-
ferent manner than large files; it attempts to organize ing them [2, 9] and because users and applications may
small files on disk so that they are near their metadata change, workloads change as well.
and other files in the directory, under the assumption that
Considerable work has been done in developing and
files in the same directory are often accessed together.
exploiting predictive models for the access patterns of
Some file systems go to more extreme lengths, such as
files (or their data blocks) [1, 30, 31]. Most work in this
storing the contents of short files in the same disk block
area focuses on rather complex and computationally ex-
as their inode [22] or storing the directory and inode in-
pensive predictive models. Furthermore, such models are
formation in the same block [11].
often needed on a file-by-file basis and do not attempt
In addition to size, other properties of files, such as to find relationships or classes among files to generalize
whether they are write-mostly or read-mostly, have been [15]. We extend this work by providing a framework for
found useful to drive various file system policies. For automatically classifying files with similar behaviors.
example, the assumption underlying the design of the
There also exist systems that use a variety of layout
log-structured file system (LFS) is that write-latency is
policies that provide non-uniform access characteristics.
the bottleneck for file system performance [26]. Hybrid
In the most extreme case, a system like AutoRAID [31]
schemes that use LFS to store write-mostly files have
employs several different methods to store blocks with
also found this approach useful [23]. In contrast, if a
different characteristics. On a more mundane level, the
file is known to be read-mostly, it may benefit from ag-
performance of nearly all modern disk drives is highly
gressive replication for increased performance and avail-
influenced by the multi-zone effect, which can cause the
ability [27].
effective transfer rate for the outer tracks of a disk to be
Unfortunately, every widespread heuristic approach considerably higher than that of the inner [19]. There
suffers from at least one of the following problems: First, is ample evidence that adaptive block layout can im-
if the heuristics are wrong, they may cause performance prove performance; we will demonstrate that we can pre-
to degrade, and second, if the heuristics are dynamic, emptively determine the layout heuristics to achieve this
they may take considerable time, computation, and stor- benefit without having to reorganize files after their ini-
age space to adapt to the current workload (and if the tial placement.
workload varies over time, the adaptation might never
Advances in artificial intelligence and machine learn-
converge).
ing have resulted in efficient algorithms for building ac-
One partial solution to the problem of inappropriate curate predictive models that can be used in today’s file

2
systems. We leverage this work and utilize a form of study [8], although we draw our samples from a longer
classification tree to capture the relationships between subset of the trace. All three traces were collected with
file attributes and their behaviors, as further described in nfsdump [10].
Section 5.
Table 1 gives a summary of the average hourly oper-
The work we present here does not focus on new ation counts and mixes for the workloads captured in the
heuristics or policies for optimizing the file system. In- traces. These show that there are differences between
stead it enables a file system to choose the proper policies these workloads, at least in terms of the operation mix.
to apply by predicting whether or not the assumptions on CAMPUS is dominated by reads and more than 85% of
which these policies rely will hold for a particular file. the operations are either reads or writes. DEAS03 has
proportionally fewer reads and writes and more meta-
data requests (getattr, lookup, and access) than
CAMPUS, but reads are still the most common opera-
3 The Traces tion. On EECS03, meta-data operations comprise the
majority of the workload.
To demonstrate that our findings are not confined to Earlier trace studies have shown that hourly opera-
a single workload, system, or set of users, we analyze tion counts are correlated with the time of day and day of
traces taken from three servers: week, and much of the variance in hourly operation count
is eliminated by using only the working hours [8]. Table
DEAS03 traces a Network Appliance Filer that serves 1 shows that this trend appears in our data as well. Since
the home directories for professors, graduate stu- the “work-week” hours (9am-6pm, Monday through Fri-
dents, and staff of the Harvard University Divi- day) are both the busiest and most stable subset of the
sion of Engineering and Applied Sciences. This data, we focus on these hours for many of our analyses.
trace captures a mix of research and development,
One aspect of these traces that has an impact on
administrative, and email traffic. The DEAS03
our research is that they have been anonymized, using
trace begins at midnight on 2/17/2003 and ends on the method described in earlier work [8]. During the
3/2/2003.
anonymization UIDs, GIDs, and host IP numbers are
EECS03 traces a Network Appliance Filer that serves simply remapped to new values, so no information is lost
the home directories for some of the professors, about the relationship between these identifiers and other
graduate students, and staff of the Electrical Engi- variables in the data. The anonymization method also
neering and Computer Science department of the preserves some types of information about file and direc-
Harvard University Division of Engineering and tory names – for example, if two names share the same
Applied Sciences. This trace captures the canonical suffix, then the anonymized forms of these names will
engineering workstation workload. The EECS03 also share the same suffix. Unfortunately, some informa-
trace begins at midnight on 2/17/2003 and ends on tion about file names is lost. A survey of the file names
3/2/2003. in our own directories leads us to believe that capital-
ization, use of whitespace, and some forms of punctua-
CAMPUS traces one of 14 file systems that hold home tion in file names may be useful attributes of file names,
directories for the Harvard College and Harvard but none of this information survives anonymization. As
Graduate School of Arts and Sciences (GSAS) stu- we will show in the remaining sections of this paper, the
dents and staff. The CAMPUS workload is almost anonymized names provide enough information to build
entirely email. The CAMPUS trace begins at mid- good models, but we believe that it may be possible to
night 10/15/2001 and ends on 10/28/2003. build even more accurate models from unanonymized
data.
Ideally our analyses would include NFS traces from
a variety of workloads including commercial datacenter
servers, but despite our diligent efforts we have not been
able to acquire any such traces.
4 The Case for Attribute-Based Predic-
tions
The DEAS03 and EECS03 traces are taken from the
same systems as the DEAS and EECS traces described
in earlier work [9], but are more recent and contain infor- To explore the associations between the create-time
mation not available in the earlier traces. The CAMPUS attributes of a file and its longer-term properties, we be-
trace is the same trace described in detail in an earlier gin by scanning our traces to extract both the initial at-
3
All Hours
Host read write lookup getattr access
DEAS03 48.7% (50.9%) 15.7% (55.3%) 3.4% (161.6%) 29.2% (49.3%) 1.4% (119.5%)
EECS03 24.3% (73.8%) 12.3% (123.8%) 27.0% (69.5%) 3.2% (263.2%) 20.0% (67.7%)
CAMPUS 64.5% (48.2%) 21.3% (58.9%) 5.8% (44.4%) 2.3% (60.7%) 2.9% (51.4%)
Peak Hours (9:00am – 6:00pm Weekdays)
DEAS03 50.0% (24.3%) 16.8% (28.9%) 3.4% (29.3%) 26.6% (29.4%) 1.3% (44.8%)
EECS03 18.2% (63.5%) 12.3% (86.7%) 27.0% (33.6%) 3.0% (129.9%) 21.5% (39.8%)
CAMPUS 63.5% (8.5%) 22.3% (16.7%) 5.6% (8.1%) 2.4% (32.6%) 3.0% (10.8%)

Table 1: The average percentage of read, write, lookup, getattr, and access operations for a fourteen day trace from
each server. The averages are shown for both all hours during the trace and for the peak hours (9:00am – 6:00pm on
weekdays). The coefficient of variation for each hourly average is given in parentheses.

tributes of each file (such as those we observe in cre- a file name to three name components (first, middle, and
ate calls) and the evolution of the file throughout the last). Files with more than three components would have
trace, so we can record information about its eventual the remainder subsumed by the middle name. For exam-
size, lifespan, and read/write ratio. From these observa- ple, the file foo.bar.gz.tmp would have a middle
tions, we are able to measure the statistical association name of bar.gz. Filenames with fewer than three com-
between each attribute and each property. The stronger ponents will take on NULL component values. There
the association, the greater the ability to predict a prop- may be other useful features within file names, but we
erty, given the attribute. are constrained by the information that remains after the
anonymization process.
Some of these associations are intuitive: files that
have the suffix .gz tend to be large, files whose names In the remainder of this section we use the chi-square
contain the string lock tend to be zero-length and live test [16] (pages 687–693) to show that the association
for a short period of time, etc. Other associations are less between a files attributes and properties is more than a
obvious. Particular users and groups, for example, of- coincidence. We provide statistical evidence that the as-
ten have unique lifespan distributions for their files. We sociation is significant and quantify the degree of asso-
find that the mode of a file (i.e., whether the file is read- ciativity for each attribute.
able or writable) often serves as a fingerprint of the en-
vironment in which it was created, and can even expose
certain idiosyncrasies of the users and their applications.
4.1 Statistical Evidence of Association
The mode is often a surprisingly good indicator of how
a file will be used, but not as one would expect: on the We use a chi-square test (also known as a two-
DEAS03 trace, for example, any file with a mode of 777 dimensional contingency table) to quantify the associa-
is likely to live for less than a second and contain zero tion between each attribute and property. The chi-square
bytes of data – it is somewhat nonintuitive that a file that test of association serves two purposes. First, it provides
is created with a mode that makes it readable and write- statistical evidence that associations exist and quantifies
able by any user on the system is actually never read or the degree of associativity for each attribute. Second, it
written by anyone. Most of these files are lock files (files is one of the mechanisms we use to automatically con-
that are used as a semaphore for interprocess commu- struct a decision tree that uses the information we extract
nication; their existence usually indicates that a process from the traces to predict the properties of files.
desires exclusive access to a particular file).
If there is no association, then the probability of a file
In order to capture some of the information expressed having a given property is independent of the attribute
by different file-naming conventions (for example, us- values. For example, suppose that we find that 50% of
ing suffixes to indicate the format of the contents of a the files we observe are write-only. If the write-only
file), we decompose file names into name components. property is associated with the file name suffix, then this
Each of these components is treated as a separate at- percentage will be different for different suffixes. For ex-
tribute of the file. We have found it is effective to use ample, we may find that 95% of the .log files are write-
a period (’.’) to delimit the components. For example, only. If no association exists, then the expected percent-
the file name foo.bar would have two name compo- age of write-only files with each extension will not differ
nents (foo and bar). To simplify our analysis, we limit from 50% in a statistically significant manner. The diffi-
4
culty with such a test is distinguishing natural variation
from a statistically significant difference; the chi-squared
test is used to detect and quantify such differences.
DEAS03 3/24/2003
siz e=0
The sum of squared differences between the expected
lifetime<=1s (file)
and observed number of files is our chi-squared statis-
mode
tic, and we calculate this statistic for each combination
gid of attribute and property. In statistical terms, we are try-
ing to disprove the null hypothesis that file attributes are
uid
not associated with file properties. A chi-squared statis-
last tic of zero indicates that there is no association (i.e., the
middle expected and observed values are the same), while the
magnitude of a non-zero statistic indicates the degree of
first association. This value is then used to calculate a p-value
0 0.2 0.4 0.6 0.8 1 which estimates the probability that the difference be-
relative strength of association tween the expected and observed values is coincidental.

EECS03 3/24/2003 For all of our tests, we have a high chi-square statis-
lifetime<=1s (file) siz e=0
tic, and the p-values are very close to zero. Therefore
we may, with very high confidence, reject the null hy-
mode
pothesis and claim that attributes are associated with file
gid properties.
uid The chi-squared test can also be used to rank the at-
tributes by the degree of association. Figure 2 shows
last
how the chi-squared values differ for the size and lifespan
middle properties. There are two important points to take from
first this figure. First, the attribute association differs across
properties for a given trace – for example, in CAM-
0 0.2 0.4 0.6 0.8 1
relative strength of association
PUS the uid shows a relatively strong association with
the lifespan, yet a weak association with the size. The
CAMPUS 10/22/2001 second point is that the relative rankings differ across
lifetime<=1s (file) siz e=0
traces. For example, on CAMPUS the middle compo-
nent of a file name has strong association with lifespan
mode
and size, but the association is much weaker on DEAS03
gid and EECS03.
uid Although we show only two properties in these
graphs, similarly diverse associations exist for other
last
properties (e.g., directory entry lifespan and read/write
middle ratio). In Section 5 we show how these associations can
first be dynamically discovered and used to make predictions.
0 0.2 0.4 0.6 0.8 1 The chi-squared test described in this section is a one-
relative strength of association way test for association. This test provides statistical ev-
idence that individual attributes are associated with file
Figure 2: The relative strength of the correlation between properties. It does not, however capture associations be-
the properties “lifetime (lftmd) is one second or shorter” tween subsets of the attributes and file properties. It also
and “size is zero” and several file attributes (as indicated does not provide an easy way to understand exactly what
by the chi-squared values) for one day of each trace. The those associations are. One can extend this methodology
chi-squared values are normalized relative to the attribute to use -way chi-square tests, but the next section dis-
with the strongest association. The last, middle, and first cusses a more efficient way for both capturing multi-way
attributes refer to components of the file name, as de- associations and extracting those associations efficiently.
scribed in Section 5.

5
5 The ABLE Predictor 6, for example, we show that by identifying small, short-
lived files and hot directories, we can use predictions to
optimize directory updates in a real file system.
The results of the previous section establish that each
of a file’s attributes (file name, uid, gid, mode) are, to ABLE consists of three steps:
some extent, associated with its long term properties
(size, lifespan, and access pattern). This fact suggests
Step 1: Obtaining Training Data. Obtain a sample of
that these associations can be used to make predictions
files and for each file record its attributes (name,
on the properties of a file at creation time. The chi-
uid, gid, mode) and properties (size, lifespan, and
squared results also give us hope that higher order as-
access pattern).
sociations (i.e., an association between more than one at-
tribute and a property) may exist, which could result in Step 2: Constructing a Predictive Classifier. For each
more accurate predictions. file property, we train a learning algorithm to clas-
To investigate the possibility of creating a predictive sify each file in the training data according to that
model from our data, we constructed an Attribute-Based property. The result of this step is a set of predic-
Learning Environment (ABLE). ABLE is a learning en- tive models that classifies each file in the training
vironment for evaluating the predictive power of file at- data and can be used to make predictions on newly
tributes. The input to ABLE is a table of information created files.
about files whose attributes and properties we have al-
ready observed and a list of properties for which we wish Step 3: Validating the Model. Use the model to pre-
to predict. The output is a statistical analysis of the sam- dict the properties of new files, and then check
ple, a chi-squared ranking of each file attribute relative to whether the predictions are accurate.
each property, and a collection of predictive models that
can be used to make predictions about new files. Each of these steps contains a number of interesting
In this paper, we focus on three properties: the file issues. For the first step, we must decide how to obtain
size, the file access pattern (read-only or write-only), and representative samples. For the second, we must choose
the file lifespan. On UNIX file systems, there are two as- a learning algorithm. For the third, we must choose how
pects of file lifespan that are interesting: the first is how to evaluate the success of the predictions. We may con-
long the underlying file container (usually implemented sider different types of errors to have different degrees of
as an inode) will live, and the other is how long a par- importance – for example, if the file system treats short-
ticular name of a file will live (because each file may be lived files in a special manner, then incorrectly predicting
linked from more than one name). We treat these cases that a file will be short-lived may be worse than incor-
separately and make predictions for each. rectly predicting that a file will be long-lived.

To simplify our evaluation, each property we wish to

predict is represented by a Boolean predicate. For exam- 5.1 Obtaining Training Data
ple:

size

size 16KB
There are two basic ways to obtain a sample of files:
from a running system or from traces. ABLE currently

inode lifespan 1 sec uses the latter approach, using the NFS traces described

file name lifespan 1 sec in Section 3.
read-only Determining some of the attributes of a file (gid, uid,
write-only mode) is a simple matter of scanning the traces and cap-
turing any command (e.g., lookup or getattr) that
We believe these properties are representative of
get or set attributes. To capture file names, ABLE simu-
properties that a file or storage system designer might
lates each of the directory operations (such as create,
use to optimize for different classes of files. For exam-
symlink, link, and rename) in order to infer the file
ple, if we know that a file will be read-only, then we
names and connect them to the underlying files.
might choose to replicate it for performance and avail-
ability, but this optimization would be inappropriate for Table 2 shows samples take from DEAS03. The spe-
files that are written frequently but rarely read. Write- cific property for this table is the write-only property;
only files might be stored in a partition optimized for each file is classified as write-only or not. For this prop-
writes (e.g., a log-structured file system), and short-lived erty, ABLE classifies each file by tracking it through the
files could live their brief lives in NVRAM. In Section trace and observing whether it is ever read.
6
Attributes Property last gid ^`_ab uid cLd _ef g !

18aa0 600
18b72
last gid mode uid wronly

444 600
18b11 600
18b7e

18aac 600
18b28
.cshrc 18aa0 600 18b72 NO

18b58 600
18c6a " #$&%('!# )+* ),

onent
.cshrc 18b11 600 18b7e NO 18aad
600
18b7f
600
18b2f +
- . /1032 . - 4 536
.cshrc 18aac 600 18b28 NO 18aad 600
18b2f +
18aab 600
18ab4 +
.cshrc 18b58 600 18c6a NO
444
18b7f +

7c
.cshrc 18abe 600 18b7f NO .pl 18a90 444 18aa1 C DFEHGJILK9M NBO PLQLRLSTU

.log 18aad 600 18b2f YES 444 18b7c
798;:+< =>< =>?A@B: @BVXWY< Z!< [\=]798;VXV
.log 18aad 600 18b2f YES ta
.log 18aab 600 18ab4 YES
ID 3/C4.5 A lg o r ith m
.login 18abe 444 18b7f NO
1. select attribute A as node to split on (based on relative ranking)
.html 18abe 444 18b7c NO 2. split h;ikjmlYn o!h+ip1prqts uwv x1y{zq|ikn }1o!h+zr~1o!zi oBqtx attribute A
.pl 18a90 444 18aa1 NO 3. if leave nodes are "pure" done
.txt 18abe 444 18b7c NO 4. else, if attributes remaining, goto 1

Table 2: ABLE training samples obtained from the Figure 3: Constructing a simple decision tree from the
DEAS03. training data in Table 2.

5.2 Constructing Prediction Models

from sample data.
The first step in ID3 is to pick the attribute that is
There are a variety of learning algorithms to build
most strongly associated with the property we are trying
classifiers for data. In general, these algorithms attempt
to predict. ABLE uses the same chi-square test described
to cluster the set of observations into a group of classes.
in Section 4 to make this determination.
For example, if we were trying to predict the write-only
property, our learning algorithm would attempt to find all Once the ID3 algorithm has determined which at-
write-only files with similar attributes, and place those tribute to select as the root node in the tree, it repeats the
into a single class (e.g., all files ending in .log). If the algorithm recursively for each of the subtrees. It re-runs
learning algorithm can successfully determine the classes the chi-square test for the sub-samples in each subtree,
of files, then it is able to classify the files. On new and selects the best attribute that has not already been
files that do not yet have established properties, the al- used in an ancestor of that subtree. The algorithm termi-
gorithm simply examines the attributes to determine the nates when all data has been correctly classified, or when
file’s class, and predicts that the properties of the new file all of the attributes have been used.
will be the properties of its class. Therefore, the first step
In the situation illustrated in Figure 3, ID3 would ter-
in making predictions is to select an appropriate learning
minate the algorithm in the left subtree after splitting on
algorithm to classify the training data. ABLE uses the
only the mode attribute because all data in the left sub-
ID3 algorithm to construct a decision tree [4].
tree has been correctly classified (i.e., any file with mode
A decision tree is ideal for our purposes for several 444 is classified as not write-only in our sample). In the
reasons: First, our attributes and properties are categori- right subtree, an additional split is necessary and the next
cal (i.e., file names, uid, gid, and mode are symbols, with attribute that would be selected is the last component of
no inherent ordering, as are the binary classifications for the file name. After this split, ID3 terminates the right
each property). Second, the computational scalability of subtree because all data has been classified.
a decision tree model makes it well-suited for use in an
Although this simple example results in a tree that
on-line system. Third, decision trees are humanly read-
perfectly classifies the data, in reality there may not be
able and allow us to gain some intuition about how the
enough attributes to perfectly split the data (i.e., all files
files are being classified. In short, decision trees are easy
in a leaf node have the same classification). More impor-
to train, produce predictions quickly, and the resulting
tantly, splitting on too many attributes may cause over-
models are easy to interpret.
fitting of the data, which leads to trees that match only
Given a sample of files, a decision tree learning algo- the training data and do not generalize well to unseen
rithm attempts to recursively split the samples into clus- files [4, 13, 20]. For example, imagine that all of the
ters. The goal is to create clusters whose files have sim- files in the training sample had a suffix of either .hot or
ilar attributes and similar classifications. Figure 3 illus- cold. If a new file with a suffix of .warm appeared in
trates the ID3 algorithm used by ABLE to induce a tree the test data, the model would not be able to classify it at
7
DEAS03 EECS03 CAMPUS
Predicate ABLE MODE ABLE MODE ABLE MODE

size=0

0 size 16KB
98.97% 58.59%
95.42% 63.00%
97.06% 66.58%
89.96% 57.69%
98.57% 94.05%
98.83% 95.00%

lftmd 1s (file) 88.16% 53.28% 93.60% 58.80% 72.95% 66.00%

lftmd 1s (direntry) 96.96% 63.74% 91.90% 52.80% 77.66% 75.49%
wronly 91.17% 51.76% 83.56% 46.98% 81.83% 82.85%
rdonly 75.55% 48.79% 71.79% 49.63% 81.24% 46.60%

Table 3: A comparison of the accuracy of the ABLE and MODE predictors for several properties for the three traces.
MODE always predicts the value that occurred most frequently in the training sample, without considering any at-
tributes of the new file.

all, even if the new file shared many other attributes with ple model named MODE that always predicts the mode
.hot or .cold files. To avoid this problem, ABLE in- of a property, which is defined as the value of the prop-
structs the ID3 algorithm to continue expanding the tree erty that occurs most frequently in the training data. For
until all attributes are exhausted (or the data perfectly example, if most of the files created on Monday were
classified) and then ABLE prunes bottom leaves of the write-only, then the MODE predictor would predict that
tree to eliminate potentially unnecessary or overly spe- every file created on Tuesday would be write-only, with-
cific. This is one of many pruning methods commonly out considering any of the file attributes. Because all our
used to favor a smaller tree to a larger one [4](pages 279– properties are binary, each prediction is either correct or
293), in the hope that a smaller tree generalizes better incorrect and the predication accuracy is simply the ratio
on future samples. Note that building the tree top-down of correct predictions to the sample size.
and selecting the most strongly associated attributes first
Table 3 shows the prediction accuracies on Tuesday,
guarantees that only the least associated attributes will be
for each of DEAS03, EECS03 and CAMPUS. In nearly
pruned in this process.
all cases, ABLE more accurately predicts the properties
of files, and in some cases, nearly doubles the accuracy
5.3 Validating the Model relative to probability-based guessing (MODE). How-
ever, there are some cases, specifically on the CAMPUS
At this point, we have used our training data to in- trace, where the workload is so uniform that MODE does
duce a decision tree model that can classify the data. The almost as well.
result is a model that can be used to classify new files
(i.e., predict their properties). For example, if a new file 5.4 MABLE and NABLE
were to be created with mode 600 and name foo.log,
the model will predict that the file will be write-only.
For our simple example, we only have one rule: a file ABLE’s decision trees successfully exploit the sta-
is write-only only if its mode is 600 and its last name is tistical association between file attributes and properties
.log. In general, a rule is a conjunction of all attributes and can be used to produce accurate predictions about
on a path to a positively classified leaf node. future file system activity. We were also curious about
which attributes make the largest contribution. The chi-
For each of the binary predicates, we induce a de-
squared analysis in Section 4 established that many of the
cision tree from a sample of files seen during the
attributes had strong associations, but this is not enough
peak hours (9am-6pm) on a Monday from the trace
to determine whether or not multi-way attribute associa-
(10/22/2001 for CAMPUS and 3/24/2003 for EECS03
tions would have much effect on prediction accuracy.
and DEAS03). We then make predictions about the files
created during the peak hours on the following day. The The easiest way to measure the effects of additional
decision to train on the peak hours of Monday and test on attributes is to compare the ABLE trees (induced using
the peak hours of Tuesday is not completely arbitrary; all available attributes) against a set of constrained trees
as shown in Section 3, the peak hours are the most ac- (induced with a limited set attributes).
tive hours of the day. The resulting size of the training
If multi-way associations exist between the attributes,
and testing samples are approximately 40,000 files for
then we can empirically measure their effect by compar-
DEAS03, 35,000 for CAMPUS, and 15,000 for EECS03.
ing prediction accuracies. To this end, we construct two
For comparison purposes, we compare against a sim- new sets of constrained decision trees, and compare these
8
against the ABLE (unconstrained) decision trees.

MABLE: trees induced with only the inode attributes

(mode, uid, gid).

NABLE: trees induced with only file names. DEAS03 03/24/2003

ABLE NABLE MABLE
100
Figure 4 compares the predication accuracies for

prediction accuracy (%)

ABLE, MABLE and NABLE. For the purpose of clarity, 80
this figure only shows the accuracy for three of our bi-
60
nary properties (size, write-only, and file name lifespan);
the results for our other properties are similar. The fig- 40
ure shows that ABLE usually outperforms both MABLE
and NABLE. This tells that some multi-way associations 20

exist between the file name attributes and other attributes 0

that allow us to make more accurate predictions when all size=0 wronly lftmd<=1s (direntry)
are considered. An example of a multi-way association
would be that the lifespan of a file depends on both the EECS03 03/24/2003
file name and the user who created the file. ABLE NABLE MABLE
100
However, the CAMPUS and EECS03 results tell us prediction accuracy (%) 80
that, in some situations, ABLE does worse than MABLE
or NABLE. In these traces, some multi-way associations 60
existed on Monday that did not generalize to new files on
Tuesday. This is a common problem of over-fitting the 40
data with too many attributes, although the differences
20
are not severe in our evaluation.
There are two important points to take away from our 0
size=0 wronly lftmd<=1s (direntry)
analysis of MABLE and NABLE. First, more attributes
are not always better. We can fall into a trap known as CAMPUS 10/22/2001
the curse of dimensionality in which each attribute adds ABLE NABLE MABLE
a new dimension to the sample space [6]. Unless we see 100
a sufficient number of files, our decision trees may get
prediction accuracy (%)

80
clouded by transient multi-way associations that do not
apply in the long run. Second, NABLE and MABLE 60
offer predictions roughly equivalent to ABLE. This is
somewhat surprising, particularly in the case of MABLE, 40
because it means that we can make accurate predictions
20
even if we do not consider file names at all.
0
Given enough training data, ABLE always outper- size=0 wronly lftmd<=1s (direntry)
forms MABLE and NABLE. For the results presented
in the paper, ABLE required an extra week of training Figure 4: Comparing the prediction accuracy of ABLE,
to detect the false attribute associations, due in part to NABLE, and MABLE for the properties size=0, write-
the small number of attributes. We anticipate that more only, and lifetime
1 second. Prediction accuracy is
training will be required for systems with larger attribute measured as percentage correct.
spaces, such as object-based storage with extended at-
tributes [18] and non-UNIX file systems such as CIFS or
NTFS [29]. Furthermore, irrelevant attributes may need
to be pre-filtered before induction of the decision tree [6]
to prevent over-fitting. The automation of ABLE’s train-
ing policies, including attribute filtering, is an area for
future work.
9
• NABLE predicts “write-mostly" if from our predictions is file cache management policy.
first=cache & last=gif [5742/94.0%] When choosing a buffer to evict, it would be helpful to
• MABLE predicts “size=0” if have an accurate prediction of whether or not that buffer
mode=777 [4535/99.8%] would be accessed in the near future (or at all). For the
• ABLE predicts “deleted within 1 sec” if DEAS03 workload, for example, we can identify write-
first = 01eb & last = 0004 & mode = 777 & only files with a high degree of accuracy, and we know
uid = 18abe [1148/99.7%]
that we can immediately evict the buffers created by writ-
ing these files. Similarly, in a disconnected environment,
knowing which files are read-only can help select files to
Figure 5: Example rules for DEAS03 discovered by
hoard.
NABLE, MABLE, and ABLE. The number of files that
match the attributes and the observed probability that Pre-fetching can also benefit from predictions; if we
these files have the given property are shown on the right. can identify files that are highly likely to be read sequen-
For example, NABLE predicts that names whose name tially from beginning to end (perhaps on a user-by-user
begins with cache and end in .gif will be “write- basis), then we can begin pre-fetching blocks for that file
mostly”. This prediction is based on observations of as soon as a client opens it. If cache space is plentiful,
5742 files, 94.0% of which have the “write-only” prop- it might make sense to do aggressive pre-fetching for ev-
erty. ery file opened for reading, but if cache space is at a pre-
mium, it is valuable to know which files will benefit the
most from this treatment.
5.5 Properties of Our Models
Our predictions may also be helpful in optimizing file
layout – if we can predict how large a file will be, and
In our experience, a model that predicts well for one what access patterns the file will experience, then we can
day will continue to perform well for at least the next pre-allocate space on the disk in order to optimally ac-
several days or weeks [9]. However, workloads evolve commodate these properties (instead of adapting to these
over time, and therefore our models will eventually de- properties as they become evident). For example, yFS
crease in accuracy. We are exploring ways to automati- uses three different block allocation and layout policies
cally detect when new models are necessary. Fortunately, for different kinds of files and migrates files from one
building a new model is an inexpensive process (requir- policy to another as they grow or their access patterns
ing approximately ten minutes of processing on a mod- change [32]. Given accurate predictions, we can begin
est Pentium-4 to build a new model from scratch for the with the correct policy instead of discovering it later.
peak hours of the heaviest workloads in our collection),
Another application of ABLE is to guide adaptive as
so one possible approach is simply to build new models
well as pro-active techniques – we can use its models to
at regular intervals, whether or not the current models
predict not only what the future holds for new files, but
have shown any sign of degradation.
also for existing files. In this paper we focus primarily
In general our decision trees yield roughly a 150:1 ra- on the prediction of the properties of new files, because
tio of files to rules. Rules can be easily inspected after this is a capability we have not had before. Nevertheless
the fact to determine interesting patterns of usage (which it is important to recognize that the ABLE models can be
is how we discovered the associations originally). On used for adaptation as well.
DEAS03, for example, the 45K sample files on which
The rest of this section discusses the use of name-
we induced a decision tree with only 300 rules (i.e., a de-
based hints to cluster active directory blocks and inodes
cision tree with 300 leafs). This means that the resulting
into a designated “hot” area of the disk. By placing this
model only requires only a few kilobytes to store.
hot area in high-speed media (e.g., NVRAM) or placing
it in the middle of the disk, we should reduce the overall
disk access time. We use as our evaluation metric the
6 Using the Predictions degree to which we induce a hot spot on the designated
area of the file system. We discuss how to benchmark
the resulting system, and measure its performance on our
Now that we have the ability to make predictions three traces.
about the future properties of a file based on its attributes
when it is created, the question remains what benefit we
can reap from this foresight.
One type of application that we believe can benefit
10
6.1 Benchmarking Attribute-Based Systems is finished, we may either fill in the missing values with
reasonable defaults or discard the incomplete items.

One of the difficulties of measuring the utility of Because we are using attribute-based models, we can-
attribute-based hints in the context of real file systems is not simply invent file attributes and hope that they will
finding a suitable benchmark. Synthetic workload gener- work. However, there is a danger that if we discard all
ators typically create files in a predictable and unrealistic the objects for which we have incomplete information,
manner – they make little or no attempt to use realis- we may lose a significant portion of workload. For the
tic file names or mimic the diverse behaviors of differ- experiment described in this section, we use only name
ent users. If we train our models on data gathered when attributes. After examining the traces we cannot find
these benchmarks are running then our predictions will names for fewer than than 5% of the files mentioned in
probably be unrealistically accurate, but if we train on a the workload (and typically much less). Therefore we
workload that does not include the benchmarks, then our believe that discarding these “anonymous files” does not
predictions for the files created by the benchmark will be alter the workload to an important degree.
uncharacteristically bad. Files or directories for which we cannot infer the par-
Our solution to this problem is to construct a ent are attached to the root directory, because from our
benchmark directly from traces of the target work- own experiments we have found that this is the direc-
load, thereby ensuring that the associations between file tory most likely to be cached on the client. For example,
names, modes, and uids during the trace will resemble we rarely see lookups for /home/username, because
those present in the actual workload. This leads imme- home directories are frequently accessed and rarely in-
diately to a new problem – in order to replay the traces, validated.
we need a real file system on which to play them. The The output of the first pass is a table of pathnames of
usual solution to this problem is to recreate the traced each file and directory observed in the trace along with a
file system from a snapshot of its metadata taken at a unique identifier for each object, and the size, mode, and
known time, and then begin replaying from that time other relevant information necessary to reconstruct the
[28]. This method works well when snapshots are avail- object. The purpose of the new identifier is to provide
able, and when a suitable device is available on which to a convenient substitute for the file handle that is inde-
reconstruct. Unfortunately we have neither – there are no pendent of the actual implementation of the file system.
publicly-available snapshots of the systems from which (File handles usually encode the mount point and inode
the traces were taken, and even if there were, reconstruct- numbers, and we cannot ensure that we will get the same
ing them would require at least 500GB of disk space and values when we reconstruct the file system.)
many hours of set-up time per test.
The second pass through the trace replaces all of the
To solve this problem, we have developed a new file handles in the trace with the unique identifiers cre-
method of performing a snapshot-less trace replay that ated in the first pass, and removes references to files for
uses the trace itself to reconstruct the subset of the file which no information could be inferred.
system necessary to replay a given section of the trace.
We call these sub-snapshots. In essence, our method is to Based on the table created after the first pass, we then
replay the trace several times, inferring knowledge about create a file system that matches the rewritten trace, and
the underlying file system by observing how it is used. replay the new trace on that file system. The result is
both realistic and repeatable.
The first pass reconstructs as much as it can of the
file system hierarchy, primarily by observing the param- Using this method, we constructed several sub-
eters and responses from lookup, getattr, create, snapshots for each workload. A typical hour of ac-
mkdir, rename, remove, and link calls. The idea tivity on these systems accesses files containing only
of discovering the file system hierarchy by snooping NFS five to ten GB of data (although there are hours when
calls is not new and has been in widespread use since the many directories are scanned, resulting in enormous and
technique was described by Blaze [3]. Unfortunately, as unwieldy sub-snapshots). One of the challenges with
other researchers have noted, this method is imperfect – DEAS03 and EECS03 is that there are apparently some
some of the information may be absent from the trace jobs that periodically scan large parts of the directory
because of missed packets or because it is cached on the hierarchy, checking the modification time of each file.
client during the trace period and thus never visible in the Since most of these files are never actually read or writ-
trace. To compensate for this missing data, we keep track ten, we could modify our sub-snapshot builder to recog-
of each file or directory that is accessed during the trace, nize this and treat these files differently (only creating a
but whose metadata we cannot infer. When the first pass short or empty file, instead of a file the same size as the

11
original). This would permit us to create sub-snapshots Heuristic Ops
Reads Writes
for a much larger fraction of the underlying file system. DEAS03
Perfect 26.17% 0.85% 42.28%
HotDir 0.57% 0.22% 0.76%
6.2 Increasing Locality of Reference HotFile 0.59% 0.00% 0.95%
HotDir+HotFile 1.10% 0.22% 1.60%
EECS03
As an example application, we explore the use of Perfect 23.89% 8.96% 41.61%
attribute-based hints to control the locality of block ref- HotDir 3.09% 1.11% 4.61%
erence by anticipating which blocks are likely to be hot HotFile 2.82% 0.00% 5.00%
and grouping them in the same cylinder. HotDir+HotFile 5.95% 1.15% 9.65%
CAMPUS
We use two methods to identify hot data blocks.
Perfect 3.90% 0.76% 11.28%
The first method, which we call HotName, automatically
HotDir 1.43% 0.58% 3.36%
classifies as hot any file that we predict will be short-lived
HotFile 1.13% 0.00% 3.70%
and/or zero-length. For this type of file, the overhead
HotDir+HotFile 2.60% 0.57% 7.23%
of creating and maintaining the inode and name the file
(i.e., the directory entry for the file) can be a large frac- Table 4: Average percentage of the total ops, reads, and
tion of the cost incurred by the file, and therefore there writes that fall in the 4MB target region of the disk for
may be benefit to reducing this overhead. The second each of the heuristics on DEAS03, EECS03, and CAM-
method, which we call HotDir, predicts which directo- PUS. The “Perfect” heuristic shows the maximum per-
ries are most likely to contain files that have the Hot- centage attainable by an algorithm with perfect knowl-
Name property. Since these directories are where the edge. The working set for these runs varies from 5-
names for the HotName files will be entered, there may 10GB.
be benefit from identifying them as well.
The model that we use for HotDir is constructed via
a method similar to ABLE, but unfortunately in our pro- 6.3 Results
totype requires some external logic because ABLE is fo-
cused on files and does not currently gather as much in-
formation about directories. In general, the HotDir rules To test our heuristics, we ran a series of one-hour
are that directories identified as home directories, mail trace replays for the hours noon-5pm for several days on
spool directories, and directories named Cache are clas- each of our traces. The models are trained on a Monday
sified as hot directories. (ABLE is capable of identifying (3/24/03 for DEAS03 and EECS03, 10/22/01 for CAM-
the mail and Cache directories as interesting, but does PUS), and the replays are constructed from the following
not currently have a “is-home-directory” attribute.) Tuesday through Thursday. Each hour-long replay be-
gins with 15 minutes to warm the cache. Then the block
To test the effect of HotDir and HotName, we have
counters are reset, and the test begins in earnest and runs
modified the FreeBSD implementation of FFS so that
for 45 minutes of replay time.
it uses a simplified predictor (similar in nature to the
ABLE predictor, but employing only name attributes, We designate a 4MB region as the target area for hot
and re-coded to live in the kernel environment) to predict objects. Our evaluation examines the distribution of ac-
whether each new directory has the HotDir property and tual accesses to the disk and compares the percentage
whether each new file has the HotName property. If so, that go to the target area to the theoretically maximum
it attempts to allocate blocks for that file or directory in number of accesses that would go to the hottest 4MB
a designated area of the disk. Our goal is to measure the region given perfect knowledge (i.e., if the hottest 256
increase in the number of accesses to this area of the disk 16KB blocks on the disk were allocated in the target re-
when we use policies guided by HotDir and HotName. gion).
We use two systems as our testbed. Both have a As shown in Table 4, both heuristics improve local-
1 GHz Pentium III processor, 1 GB of RAM, and run ity compared to the default layout policy, and using both
FreeBSD 4.8p3. Our experiments use the FreeBSD im- heuristics is an improvement over using either one alone.
plementation of FFS with 16KB blocks and soft-updates Write locality is increased more than read locality; this
enabled [12]. We have instrumented the device driver for is not surprising because directory contents are read-
the disk so that it keeps a count of how many reads and cached. Using both HotDir and HotName, we manage to
writes are done on each 16KB logical disk block. increase the number of accesses to two-thirds of that of
12
the hottest possible region on CAMPUS, and on EECS03 In addition to caching and on-disk layout optimiza-
nearly 6% of all the disk accesses during the trace are tion, we envision a much larger class of applications that
now confined to the target area. These percentages may will benefit from dynamic policy selection. Attribute-
seem small, but keep in mind that we are focusing only based classification of system failures and break-ins (or
on small files and directories, and normal file traffic is anomaly detection) is a natural adjunct to this work
the dominant cause of disk accesses in these workloads. (e.g., “has this file been compromised?”). Moreover,
through the same clustering techniques implemented by
our decision trees, we feel that semantic clustering can
be useful for locating information (e.g., “are these files
7 Conclusions related?”). Both of these are areas of future work.

We have shown that the attributes of a file are strong Acknowledgments

hints of how that file will be used. Furthermore, we have
exploited these hints to make accurate predictions about
the longer-term properties of files, including the size, Daniel Ellard and Margo Seltzer were sponsored in
read/write ratio, and lifespan. Overall, file names pro- part by IBM. The CMU researchers thank the mem-
vide the strongest hints, but using additional attributes bers and companies of the PDL Consortium (including
can improve prediction accuracy. In some cases, accu- EMC, Hewlett-Packard, Hitachi, IBM, Intel, Microsoft,
rate predictions are possible without considering names Network Appliance, Oracle, Panasas, Seagate, Sun, and
at all. Using traces from three NFS environments, we Veritas) for their interest, insights, feedback, and sup-
have demonstrated how classification trees can predict port. Their work is partially funded by the National Sci-
file and directory properties, and that these predictions ence Foundation, via grants #CCR-0326453 and #CCR-
can be used within an existing file system. 0113660.
Our results are encouraging. Contemporary file sys-
tems use hard-coded policies and heuristics based on
general assumptions about their workloads. Even the
most advanced file systems do no more than adapt to vio-
References
lations of these assumptions. We have demonstrated how
to construct a learning environment that can discover pat- [1] Sedat Akyurek and Kenneth Salem. Adap-
terns in the workload and predict the properties of new tive Block Rearrangement. Computer Systems,
files. These predictions enable optimization through dy- 13(2):89–121, 1995.
namic policy selection – instead of reacting to the prop- [2] J. Michael Bennett, Michael A. Bauer, and David
erties of new files, the file system can anticipate these Kinchlea. Characteristics of Files in NFS Environ-
properties. Although we only provide one example file ments. In Proceedings of ACM SIGSMALL Sympo-
system optimization (clustering of hot directory data), sium on Small Systems/PC., pages 18–25, Toronto,
this proof-of-concept demonstrates the potential for the Ontario, Canada, 1991.
system-wide deployment of predictive models. [3] Matthew A. Blaze. NFS Tracing by Passive Net-
work Monitoring. In Proceedings of the USENIX
ABLE is a first step towards a self-tuning file system
Winter 1992 Technical Conference, pages 333–343,
or storage device. Future work involves automation of
San Fransisco, CA, January 1992.
the entire ABLE process, including sample collection,
attribute selection, and model building. Furthermore, [4] Leo Breiman, Jerome H. Friedman, Richard A. Ol-
since changes in the workload will cause the accuracy shen, and Charles J. Stone. Classification and Re-
of our models to degrade over time, we plan to auto- gression Trees. Chapman and Hall, 1984.
mate the process of detecting when models are failing [5] Pei Cao, Edward W. Felten, Anna R. Karlin,
(or are simply suboptimal) and retraining. When cata- and Kai Li. Implementation and Performance
clysmic changes in the workload occur (e.g., tax season of Integrated Application-Controlled File Caching,
in an accounting firm, or September on a college cam- Prefetching, and Disk Scheduling. ACM Transac-
pus), we must learn to detect that such an event has oc- tions on Computer Systems, 14(4):311–343, 1996.
curred and switch to a new (or cached) set of models. [6] Rich Caruana and Dayne Freitag. Greedy Attribute
We also plan to explore mechanisms to include the cost Selection. In International Conference on Machine
of different types of mispredictions in our training in or- Learning, pages 28–36, 1994.
der to minimize the anticipated total cost of errors, rather [7] Ian Dowse and David Malone. Recent Filesystem
than simply trying to minimize the number of errors. Optimisations on FreeBSD. In Proceedings of the
13
USENIX Annual Technical Conference (FREENIX ceedings of the 1996 Symposium on Operating
Track), pages 245–258, June 2002. Systems Design and Implementation, pages 3–17.
[8] Daniel Ellard, Jonathan Ledlie, Pia Malkani, and USENIX Association, 1996.
Margo Seltzer. Passive NFS Tracing of Email [22] Sape Mullender and Andrew Tanenbaum. Immedi-
and Research Workloads. In Proceedings of the ate Files. In Software – Practice and Experience,
Second USENIX Conference on File and Storage number 4 in 14, April 1984.
Technologies (FAST’03), pages 203–216, San Fran- [23] Keith Muller and Joseph Pasquale. A High Per-
cisco, CA, March 2003. formance Multi-Structured File System Design. In
[9] Daniel Ellard, Jonathan Ledlie, and Margo Seltzer. Proceedings of the 13th ACM Symposium on Oper-
The Utility of File Names. Technical Report TR- ating Systems Principles (SOSP-91), pages 56–67,
05-03, Harvard University Division of Engineering Asilomar, Pacific Grove, CA, October 1991.
and Applied Sciences, 2003. [24] David A. Patterson, Garth Gibson, and Randy H.
[10] Daniel Ellard and Margo Seltzer. New NFS Tracing Katz. Case for Redundant Arrays of Inexpensive
Tools and Techniques for System Analysis. In Pro- Disks (RAID). In In Proceedings of the ACM
ceedings of the Seventeenth Annual Large Installa- Conference on Management of Data (SIGMOD),
tion System Administration Conference (LISA’03), Chicago, IL, June 1988.
pages 73–85, San Diego, CA, October 2003. [25] R. Hugo Patterson, Garth A. Gibson, Eka Gint-
[11] Gregory R. Ganger and M. Frans Kaashoek. Em- ing, Daniel Stodolsky, and Jim Zelenka. Informed
bedded Inodes and Explicit Grouping: Exploiting Prefetching and Caching. In ACM SOSP Proceed-
Disk Bandwidth for Small Files. In USENIX An- ings, 1995.
nual Technical Conference, pages 1–17, 1997. [26] Mendel Rosenblum and John K. Ousterhout. The
[12] Gregory R. Ganger, Marshall Kirk McKusick, Design and Implementation of a Log-Structured
Craig A. N. Soules, and Yale N. Patt. Soft Updates: File System. ACM Transactions on Computer Sys-
a Solution to the Metadata Update Problem in File tems, 10(1):26–52, 1992.
Systems. ACM Transactions on Computer Systems, [27] Yasushi Saito, Christos Karamanolis, Magnus
18(2):127–153, 2000. Karlsson, and Mallik Mahalingam. Taming Ag-
[13] Trevor Hastie, Robert Tibshirani, and Jerome gressive Replication in the Pangaea Wide-Area File
Friedman. The Elements of Statistical Learning. System. In Proceedings of the 5th Symposium
Spring-Verlag, 2001. on Operating Systems Design and Implementation
[14] Butler W. Lampson. Hints for Computer System (OSDI)., Boston, MA, December 2002.
Design. In ACM Operating Systems Review, vol- [28] Keith A. Smith and Margo I. Seltzer. File Sys-
ume 15(5), pages 33–48, October 1983. tem Aging - Increasing the Relevance of File Sys-
[15] Tara M. Madhyastha and Daniel A. Reed. In- tem Benchmarks. In Proceedings of SIGMETRICS
put/Output Access Pattern Classification Using 1997: Measurement and Modeling of Computer
Hidden markov Models. In Proceedings of IOPAF, Systems, pages 203–213, Seattle, WA, June 1997.
pages 57–67, San Jose, CA, December 1997. [29] David A. Solomon and Mark E. Russinovich. In-
[16] James T. McClave, Frank H. Dietrich II, and Terry side Microsoft Windows 2000, Third Edition. Mi-
Sincich. Statistics. Prentice Hall, 1997. crosoft Press, 2000.
[17] Marshall K. McKusick, William N. Joy, Samuel J. [30] Carl Hudson Staelin. High Performance File Sys-
Leffler, and Robert S. Fabry. A Fast File System for tem Design. Technical Report TR-347-91, Prince-
UNIX. Computer Systems, 2(3):181–197, 1984. ton University, 1991.
[18] Michael P. Mesnier, Gregory R. Ganger, and Erik [31] John Wilkes, Richard Golding, Carl Staelin, and
Riedel. Object-Based Storage. ACM Communica- Tim Sullivan. The HP AutoRAID Hierarchical
tions Magazine, 41(8):84–90, 2003. Storage System. In High Performance Mass Stor-
[19] Rodney Van Meter. Observing the Effects of Multi- age and Parallel I/O: Technologies and Applica-
Zone Disks. In Proceedings of the Usenix Technical tions, pages 90–106. IEEE Computer Society Press
Conference, January 1997. and Wiley, 2001.
[20] Tom M. Mitchell. Machine Learning. McGraw- [32] Zhihui Zhang and Kanad Ghose. yFS: A Journaling
Hill, 1997. File System Design for Handling Large Data Sets
[21] Todd C. Mowry, Angela K. Demke, and Or- with Reduced Seeking. In Proceedings of the Sec-
ran Krieger. Automatic Compiler-Inserted I/O ond USENIX Conference on File and Storage Tech-
Prefetching for Out-Of-Core Applications. In Pro- nologies (FAST’03), pages 59–72, San Francisco,
CA, March 2003.
14

View publication stats

CMU-PDL-04-101
No ratings yet
CMU-PDL-04-101
14 pages
Predicting Performance Via Automated Feature-Interaction Detection
No ratings yet
Predicting Performance Via Automated Feature-Interaction Detection
11 pages
Research Paper 3 2004
No ratings yet
Research Paper 3 2004
8 pages
Research Paper 1 2011
No ratings yet
Research Paper 1 2011
5 pages
SAP THR 81 - Scenario Based Questions
100% (2)
SAP THR 81 - Scenario Based Questions
17 pages
Unit-1 Oracle Introduction & Fundametal Commands
No ratings yet
Unit-1 Oracle Introduction & Fundametal Commands
50 pages
DS UNIT-3
No ratings yet
DS UNIT-3
100 pages
ErrorCorrection JackWolf
No ratings yet
ErrorCorrection JackWolf
146 pages
5 Punja Nist 2014 BB Forensics FULL PDF
No ratings yet
5 Punja Nist 2014 BB Forensics FULL PDF
206 pages
CodeIgniter4 0 0-Beta 1 PDF
No ratings yet
CodeIgniter4 0 0-Beta 1 PDF
761 pages
Num Py
No ratings yet
Num Py
20 pages
Haboob Team: Windows Privilege Escalations
No ratings yet
Haboob Team: Windows Privilege Escalations
17 pages
Microprocessor Module
No ratings yet
Microprocessor Module
90 pages
S.Y.-B.Sc_.-III_Data-Structure-I
No ratings yet
S.Y.-B.Sc_.-III_Data-Structure-I
18 pages
1-Megabit (128K X 8) 5-Volt Only Flash Memory AT29C010A: Features
No ratings yet
1-Megabit (128K X 8) 5-Volt Only Flash Memory AT29C010A: Features
19 pages
U2000 Northbound Inventory File Interface Developer Guide
No ratings yet
U2000 Northbound Inventory File Interface Developer Guide
61 pages
Network Programmability With Cisco APIC-EM
No ratings yet
Network Programmability With Cisco APIC-EM
43 pages
Dynatrace JIRA Integration-1
No ratings yet
Dynatrace JIRA Integration-1
7 pages
Oracle: Group Members: Hamza Ahmad
No ratings yet
Oracle: Group Members: Hamza Ahmad
28 pages
Tech Note 946 - Working With DASGESRTP v2
No ratings yet
Tech Note 946 - Working With DASGESRTP v2
9 pages
Understanding Content Server Notifications and Event Processing
No ratings yet
Understanding Content Server Notifications and Event Processing
17 pages
DDB Lec 1
No ratings yet
DDB Lec 1
18 pages
Final Review
No ratings yet
Final Review
7 pages
Structured Query Language (SQL)
No ratings yet
Structured Query Language (SQL)
30 pages
UNIX Enumeration Tools: (Samba)
No ratings yet
UNIX Enumeration Tools: (Samba)
7 pages
Understanding DB2 Bufferpool Tuning 2005 Final
No ratings yet
Understanding DB2 Bufferpool Tuning 2005 Final
40 pages
Practical File of SQL Queries Dbms
No ratings yet
Practical File of SQL Queries Dbms
21 pages
BillQuick SQL Server 2012 Express Installation Guide
No ratings yet
BillQuick SQL Server 2012 Express Installation Guide
11 pages
Ftns Questions
No ratings yet
Ftns Questions
13 pages
Docx
No ratings yet
Docx
3 pages
Introduction / SIP Troubleshooting: Panasonic Corporation Connected Solutions Company
No ratings yet
Introduction / SIP Troubleshooting: Panasonic Corporation Connected Solutions Company
5 pages
Microprocessor Lecture 1
No ratings yet
Microprocessor Lecture 1
4 pages
Advance Backend
No ratings yet
Advance Backend
1 page
ABAP Program To Delete A File From The Application Server - Code Gallery - SCN Wiki
No ratings yet
ABAP Program To Delete A File From The Application Server - Code Gallery - SCN Wiki
4 pages
AIX Systems Administration and Architecture: Definitive Reference for Developers and Engineers
From Everand
AIX Systems Administration and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed File Systems Engineering: Definitive Reference for Developers and Engineers
From Everand
Distributed File Systems Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CephFS Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
CephFS Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Entity-Component System Design Patterns: Definitive Reference for Developers and Engineers
From Everand
Entity-Component System Design Patterns: Definitive Reference for Developers and Engineers
Richard Johnson
1/5 (1)
Effective Makefiles: Definitive Reference for Developers and Engineers
From Everand
Effective Makefiles: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Essential Shell Scripting and Automation: Definitive Reference for Developers and Engineers
From Everand
Essential Shell Scripting and Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Apache Mesos: Definitive Reference for Developers and Engineers
From Everand
Practical Apache Mesos: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sublime Text Essentials: Definitive Reference for Developers and Engineers
From Everand
Sublime Text Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming Filesystems with FUSE: Definitive Reference for Developers and Engineers
From Everand
Programming Filesystems with FUSE: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Zsh Essentials: Definitive Reference for Developers and Engineers
From Everand
Zsh Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Meson Build System Essentials: Definitive Reference for Developers and Engineers
From Everand
Meson Build System Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Ubuntu Administration Essentials: Definitive Reference for Developers and Engineers
From Everand
Ubuntu Administration Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Puppet Configuration Management Essentials: Definitive Reference for Developers and Engineers
From Everand
Puppet Configuration Management Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ELK Stack Architecture and Operations: Definitive Reference for Developers and Engineers
From Everand
ELK Stack Architecture and Operations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive openSUSE Administration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive openSUSE Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
HP-UX System Administration Guide: Definitive Reference for Developers and Engineers
From Everand
HP-UX System Administration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
IPFS Protocol Engineering: Definitive Reference for Developers and Engineers
From Everand
IPFS Protocol Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rsync Solutions: Definitive Reference for Developers and Engineers
From Everand
Rsync Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Fuse Implementation: Definitive Reference for Developers and Engineers
From Everand
Advanced Fuse Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gentoo Linux Configuration and Optimization: Definitive Reference for Developers and Engineers
From Everand
Gentoo Linux Configuration and Optimization: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SQLite Essentials: Definitive Reference for Developers and Engineers
From Everand
SQLite Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Build Systems with Buck: Definitive Reference for Developers and Engineers
From Everand
Efficient Build Systems with Buck: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering the Art of Unix Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Unix Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network File System in Practice: Definitive Reference for Developers and Engineers
From Everand
Network File System in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NetBSD Systems and Architecture: Definitive Reference for Developers and Engineers
From Everand
NetBSD Systems and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Apache Ant in Practice: Definitive Reference for Developers and Engineers
From Everand
Apache Ant in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Debian System Essentials: Definitive Reference for Developers and Engineers
From Everand
Debian System Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Alpine Linux Administration: Definitive Reference for Developers and Engineers
From Everand
Alpine Linux Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Systemd-nspawn in Practice: Definitive Reference for Developers and Engineers
From Everand
Systemd-nspawn in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Shell Scripting Step by Step: A Practical Guide with Examples
From Everand
Shell Scripting Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
From Everand
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Bacula Essentials: Definitive Reference for Developers and Engineers
From Everand
Bacula Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Java File Handling Step by Step: A Practical Guide with Examples
From Everand
Java File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Duplicati Essentials: Definitive Reference for Developers and Engineers
From Everand
Duplicati Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Workflows with Notepad++: Definitive Reference for Developers and Engineers
From Everand
Efficient Workflows with Notepad++: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering the Art of Linux Kernel Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Linux Kernel Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Neutralino.js Essentials: Definitive Reference for Developers and Engineers
From Everand
Neutralino.js Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Zorin OS Administration and User Guide: Definitive Reference for Developers and Engineers
From Everand
Zorin OS Administration and User Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Python File Handling Made Easy: A Practical Guide with Examples
From Everand
Python File Handling Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
DB2 Administration and Optimization Guide: Definitive Reference for Developers and Engineers
From Everand
DB2 Administration and Optimization Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Basic Principles of an Operating System: Learn the Internals and Design Principles
From Everand
Basic Principles of an Operating System: Learn the Internals and Design Principles
Priyanka Rathee
No ratings yet
Essential Apache Beam: Definitive Reference for Developers and Engineers
From Everand
Essential Apache Beam: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SystemTap Essentials: Definitive Reference for Developers and Engineers
From Everand
SystemTap Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Operating Systems: Concepts to Save Money, Time, and Frustration
From Everand
Operating Systems: Concepts to Save Money, Time, and Frustration
Jonathan Rigdon
No ratings yet
Building an Operating System with Rust: A Practical Guide
From Everand
Building an Operating System with Rust: A Practical Guide
Robert Johnson
No ratings yet
Extending Puppet - Second Edition
From Everand
Extending Puppet - Second Edition
Alessandro Franceschi
No ratings yet
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Zipkin: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Debian Linux Handbook: A Practical Guide for Users and Administrators
From Everand
The Debian Linux Handbook: A Practical Guide for Users and Administrators
Robert Johnson
No ratings yet
JavaScript File Handling from Scratch: A Practical Guide with Examples
From Everand
JavaScript File Handling from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Best Free Open Source Data Recovery Apps for Mac OS English Edition
From Everand
Best Free Open Source Data Recovery Apps for Mac OS English Edition
Cyber Jannah Sakura
No ratings yet

Research Paper 5 2004

Uploaded by

Research Paper 5 2004

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Attribute-Based Prediction of File Properties

Article · January 2004

Daniel Ellard Michael P. Mesnier

SEE PROFILE SEE PROFILE

Curveball View project

Federated File Syste View project

The user has requested enhancement of the downloaded file.

Abstract during training online

Harvard University Division of Engineering and Applied Sciences.

Parallel Data Laboratory, Carnegie Mellon University.

To simplify our evaluation, each property we wish to

5.2 Constructing Prediction Models

MABLE: trees induced with only the inode attributes

NABLE: trees induced with only file names. DEAS03 03/24/2003

prediction accuracy (%)

exist between the file name attributes and other attributes 0

We have shown that the attributes of a file are strong Acknowledgments

View publication stats

You might also like