01 - Using Distributed OLTP Technology in A High Performance Storage System
01 - Using Distributed OLTP Technology in A High Performance Storage System
Terrill W. Tyler
IBM Government Systems
Houston, Texas
David S.Fisher
Lawrence Livermore National Laboratory
Livermore, California
45
1051-9173/95 $4.00 0 1995 IEEE
Fourteenth IEEE Symposium on Muss Storage Systems
Need for ACID properties in distributed mass The first two operations associated with creating the
storage systems file involve the Name Server and Bitfile Server. To main-
tain the integrityof the mass storage system database, these
A distributed mass storage system is another example of operations should be transactional-either both succeed to-
an application that requires ACID properties in managing gether or neither take place. A situation in which the Name
shared persistent information. While the implementation of Server’s file creation operation succeeds, but the Bitfile
a distributed mass storage system does not blaze new trails Server’s file creation operation fails without undoing the
in the application of OLTP concepts, it does introduce cer- Name Server’s operation should not be permitted to hap-
tain distinguished traits in contrast to existing commercial pen. This would cause file entries to build up in the Name
OLTP applications. Server that had no corresponding file entry in the Bitfile
Distributed mass storage systems must maintain per- Server.
sistent distributed database information, referred to in this This is just one example from the above scenario
paper as metuduta, describing the files it manages as well as where multiple metadata operations between different
for lower-level storage elements such as cartridges, disks, servers/processes must be treated as a single, aggregate,
storage segments, physical volumes, virtual volumes, file transactional operation.If ACID properties are not enabled
segments, etc. Regardless of whether this metadata is phys- in the system, individual servers may be able to keep their
ically distributed among numerous nodes or is centralized, respective databases correct, but, as a whole, the system
the fact that multiple processes and/or threads are indepen- metadata eventually will become inconsistent.
dently updating and contending for this metadata on behalf
of multiple, concurrent end-user requests dictate that ACID Use of commercially available OLTP
properties be provided. products
An even more basic need for these ACID properties
can be seen when analyzing a single end-user request. For Developing the underlying services that provide ACID
example, consider an application that creates a new file, properties for metadata updates, customized for a dis-
writes a single block of data, and closes the file. A generic tributed mass storage system, would be costly. Not only
distributed mass storage system modeled after the IEEE must the initial development of this software be considered
Reference Model for Open Storage Systems Interconnec- but the additional maintenance costs throughout the prod-
tion would likely involve the following types of general uct’s full life-cycle must be factored in as well. Significant
metadata operations: cost avoidance can be realized by building storage systems
46
Fourteenth IEEE Symposium on Muss Storage Systems
around commercially available OLTP products. It is the 0 Nested itransactions across and within servers
assertion of this paper that these commercial off-the-shelf 0 A transactional record-oriented file system (the Struc-
products can be used successfully in these applications. tured File Server) for the storage of system metadata
0 Transaction call-backs
Case study-the High Performance Storage
System Distributed transaction services across multiple servers
The remainder of this paper presents a case study in the de- Encina provides, as its basic service, a distributed OLTP
sign and implementation of a mass storage system employ- system that is used by HPSS servers to coordinate changes
ing an OLTP. The mass storage system discussed is HPSS, to metadata across multiple, independently executing, mul-
the High Performance Storage System [ 5 ] , which uses tithreaded servers. Each server provides one or more appli-
Encina, a distributed OLTP system provided by Transarc cation programming interfaces (APIs), defining functions
Corporation [6]. that may be iinvoked through remote procedure calls (RPCs)
from the client to the server. Many of these functions are
Background transactional, meaning that they must be performed in the
scope of a transaction created by the client before the func-
HPSS is a major development project within the National
tion is invoked. Programs that initiate transactionsin HPSS
Storage Laboratory (NSL). The primary development part-
include the HPSS user API library and several of the servers.
ners for HPSS are Lawrence Livermore, Los Alamos, Oak
Encina extends and coordinates transactions across RPCs
Ridge and Sandia National Laboratories, and IBM Gov-
so that the server boundaries become transparent with re-
ernment Systems. Other partners include Comell, NASA
spect to the bransactions.
Lewis, and NASA Langley Research Centers.
In these itransactional remote procedure calls (TRPCs),
HPSS provides a scalable, parallel, high-performance,
changes that are made to HPSS metadata as the procedure
hierarchical storage system for highly parallel computers as
executes take on transactional characteristics. Records are
well as traditional supercomputersand workstation clusters.
locked prior to modification so that they may not be ac-
A key architectural requirement is the scalability of data
cessed or modified by other transactions until the locking
transfer rates, storage capacity, and the number and size
transaction completes (isdution property). Log files are
of file objects. HPSS is a general-purpose storage system
used to record the state of modified records before and after
that has been developed to scale for order of magnitude
modification (durability property). If the transaction com-
performance improvements.
mits, the meudata changes made by the procedure become
To meet the high-end storage system and datamanage-
permanent. If the transaction aborts, the metadata changes
ment requirements, HPSS is designed to use both network-
are reversed as if they never occurred. The consistency
connected and directly connected storage devices and to
property of transactions guarantees that all of the metadata
employ parallel I/O techniques, including software strip-
changes made in the transaction either commit together or
ing, to achieve high transfer rates. The design is based on
abort together. While the transaction is in progress, the new
the IEEE Reference Model for Open Storage Systems Jn-
values of changed metadata records are not available out-
terconnection (Project 1244) [7].
side of the transaction. This property of transactions, the
Encina provides the distributed OLTP services used by
isolationproperty, allows HPSS servers to make changes to
HPSS as describedbelow. The OSF Distributed Computing
system metadata records inside a transaction family, with-
Environment (DCE) is employed by Encina, and to a lesser
out having to protect the changes from examination by other
extent directly by the HPSS servers, to provide support-
threads in the servers.
ing distributed computing services. Of the wide variety of
Using Encina allows HPSS to be partitioned into log-
services offered by DCE, HPSS and Encina make use of
ically defined servers and allows the extension of transac-
server multithreading,remote procedure calls (RPC), server
tional semantiics to metadata changes that take place while
interface registration and identification, global unique ob-
performing functions in one or more of these servers. Func-
ject naming, cliendserver authentication and authorization,
tions such as file creation, file cataloging, file reading and
and other security services.
writing, and file deletion typically cause a transaction to be
Use of Encina in HPSS extended acroiss two or three servers. Transaction bound-
aries are chosen so that all of the metadata changes associ-
The Encina distributed OLTP system is used in HPSS to ated with one of these functions are contained in a single
provide several essential support services: transaction, regardless of the number of servers involved
and the number of metadata records changed. When the
o Distributedtransaction services across multiple servers transaction commits, all of the records appear to change si-
47
Fourteenth IEEE Symposium on Mass Storage Systems
multaneously (atomicityproperty). If the transaction aborts Without subtransactions, an abort after an error in a
all of the original values of the records appear to be restored called server procedure would abort the caller’s transac-
at once. ‘tion.All metadata changes accumulated in the transaction
Transactions protect the integrity of system metadata would be lost. This is undesirable because the caller may
during a failure of the system. At the time of failure, each be able to recover from the called procedure’s error by call-
transaction extant either will have committed or not. There ing a different procedure or server, or taking some other
is no “halfway” point in committing a transaction. If the corrective action. However, if the called procedure creates
transaction commits and the system then immediately fails, a subtransaction in which it performs its metadata changes,
the committed metadata changes will be restored during it can abort the subtransaction, restoring the metadata to its
the process of restarting the system (durability property). original state, and return an error code to the caller via the
If the transaction has not committed at the time of failure, TRPC. The metadata changes made in the called procedure
it is aborted during system restart, and all changes made abort but the changes made in the client remain intact. The
to metadata are reversed. In either case, the integrity of the client may then recover from the error or may abort its own
system metadata is ensured. operation as required.
Encina provides a C language interface to the trans- As an example of the use of subtransactions in HPSS,
action system, Tran-C, which is used extensively in HPSS consider the case of writing an HPSS file (Figure 1). The
servers. Tran-C, which is implemented as a family of C bitfile server creates a top ancestor transaction and calls
macros and functions, provides easily used language con- a storage server, via a TRPC, to create storage space for
structs for creating, defining, and controlling transactions the file. The storage server allocates space and changes its
[8]. HPSS also makes use of a number of Encina library metadata in a subtransaction. When the TRPC returns to the
functions to set up transaction call-backs and control cer- bitfile server, the bitfile server links the new storage space
tain transaction details. Using Tran-C and the TRPC mech- into the file in the top ancestor transaction.
anism, a thread of control can create a transaction and pass If an error occurs in the storage server, it aborts its
it to one or more remote servers as if the remote servers subtransaction, reversing its metadata changes, if any, and
were unified in a single process. returns an error code to the bitfile server. The bitfile server
may then either recover from the error by allocating space
Nested transactions in HPSS in another storage server, or may abort its own metadata
changes, and any committed subtransactions, and return
Encina provides a nested transaction facility that is used an error to the caller of the file-create operation. In each
extensively in HPSS servers. In any given thread of con- of these outcomes, the integrity of the HPSS metadata
trol (which via TRPC may extend across servers) the iirst database is maintained.
invocation of a Tran-C transactionstatement causes a “top-
level” transaction to be created. This transaction becomes
the “top ancestor” transaction of a possible family of nested Structured File Server
transactions. Any subsequent transaction statement nested
in the top ancestor transaction, either in the server that HPSS stores its system metadata, the information needed
created the top ancestor, or in a called server, creates a by each server to describe the storage objects it provides,
subtransaction that becomes part of the transaction family. in SFS, the Encina Structured File Server. SFS provides a
Subtransactions, in turn, may create their own subtrans- record-oriented storage facility for managing large numbers
actions. If a subtransaction aborts, all of its descendant of records and accessing those records using one or more
transactions are also aborted, but not its ancestors. When keys. Access to information stored in SFS records can be ei-
a subtransaction commits, any metadata changes it makes ther transactional or nontransactional. Transactional access
become permanent if and when all of its ancestors, up to exhibits the ACID properties described earlier.
and including the top ancestor transaction, commit. SFS can associate locks with the records in its files
In HPSS we isolate groups of system metadata changes and binds these locks to transactions. The locks segregate
from one another by creating subtransactions in the server access to the records to transactions, providing the isolation
procedures that make the changes. If an error occurs while property of transactional access to HPSS metadata files.
making a metadata change, the subtransaction is aborted, SFS logs changes made to records by transactions in a log
reversing the change, and an error is returned for the pro- that is used to restore the records to their original state if
cedure outcome. Other metadata changes, made in other a transaction aborts, and to assure permanence to changes
procedures as part of the HPSS operation being performed, that are committed.
are not affected. The decision to abort these changes can be Each transactional HPSS server maintains one or more
made by the caller of the failing procedure. system metadata files in SFS. Records in HPSS metadata
48
Fourteenth IEEE Symposium on Mass Storage Systems
2
recover( );
if(error)
abort( );
if( error)
abort();
ModifyRec20;
to storage segments (top-down), and storage segment are
linked to bitfiles (bottom-up). Name server entries are
forward-linked to biffiles and biffiles are reverse-linked to
CreateFileRec( 1 name server entries. Top-down traversal of the database is
if(error) abort( 1; used to locaite file metadata information to carry out ordi-
abort( ); nary client requests (create /read / write / delete). Bottom-
L
up traversal of the database is used to carry out certain
return(error); return(error1; maintenance activities, for example, locating all of the files
1 stored entirely or partially on a failed physical volume.
Transaction call-backs
Figure 1. An example of using subtransactions in HPSS
servers to isolate error effects. The BitJEe Server creates Encina provides a facility in which a process may arrange
transaction TI and calls the Storage Server to create for a procedure to be called when a selected event occurs
storage space. The Storage Server creates subtransaction during the prlocessing of a transaction. These events include
T2 and creates the storage space. l f a n error occurs while transaction preparation, transaction resolution, transaction
allocating space, the Storage Server aborts 72 and an abort, and others, and can be established for top-level or
error is returned. TI is not aborted and may recover from nested transactions. The procedure call-back is made by
the errol: IfTI subsequentlyfails, T2 is aborted regardless an Encina library routine, in a thread managed by Encina,
of its original outcome. when the requested event occurs. From the point of view of
the process, the procedure call happens asynchronously.
files are indexed using a primary key and, in many cases, HPSS uses the call-back mechanism extensively to co-
one or more secondary keys. In general, primary keys are ordinate tranr;actions with various maintenance activities in
unique within HPSS metadata files, whereas secondarykeys the servers. For example, several servers use call-backs to
may or may not be unique. maintain a polo1 of open file descriptors (OFDs) for the SFS
The objects defined by HPSS servers are usually con- metadata file!;. An OFD is a file access handle provided by
nected to one another within a server, and across servers, SFS. When an OFD for a metadata file is needed by the
via primary and secondary keys embedded in the meta- server, it is ta!ken from a free pool and used in a transaction,
data records. Keys are placed in records so that the HPSS which causes the OFD to be associated with the transaction.
database taken as a whole may be traversed from either the A call-back is set up by an HPSS library routine when the
top down or the bottom up. The system metadata architec- OFD is taken from the pool to call a server procedure that
ture is optimized for top-down traversal. For example, each returns the OFD to the free pool. In this case, the call-back
storage segment record maintained by a storage server is occurs when the top ancestor transaction resolves (either
keyed by a storage segment key. The record has embedded commits or aborts).
in it the key of the virtual volume from which the storage In the HF’SS disk storage server, call-backs are used to
segment is allocated. This key is a secondary key for the implement an in-memory cache of active metadata records.
storage segment record, while the storage segment key is In many cases, this cache allows the server to reference
the primary key. Several storage segments may be allocated metadata without taking the time to read it from the meta-
in the same virtual volume, so the secondary virtual volume data file. Records are locked when they are referenced,
keys will be shared by several storage segment records and using the identifier of the top ancestor transaction as the
will not be unique in the storage segment metadata file. key. The cache entry remains locked to the transaction until
Given a storage segment key, the associated virtual volume the transaction resolves. Subtransactions may refer to the
can be located immediately by searching the storage seg- cache entry once it is locked by using the same top ancestor
ment metadata file with the unique key. Only one record transaction ID. Other transactions and subtransactions are
will be returned, and it will contain the virtual volume key. blocked from using the entry until the locking transaction
This is top-down traversal of the database. resolves.
49
Fourteenth IEEE Symposium on Mass Storage Systems
50
Fourteenth IEEE Symposium on Mass Storuge Systems
writable. Later, a system utility will find this volume, locate In the storage server, the session construct is used to
its actual next write address, and update the HPSS system group I/O operations together and reserve systemresources.
metadata accordingly. The steps of marking the volume The bitlile server opens a session with the storage server
unwritable and correcting the volume’s Next-Address-To- each time it performs a file open operation. Initially, the
Write are done in top-level transactions, isolated from any session has no system resources assigned to it. When the
other transaction that may be in effect. first U 0 operation on the file is processed by the bitfile
Note that this “transactionsvs. storagemedia” problem server, the associated storage server assigns to the session
is not a serious one for HPSS because of two factors: the corresponding storage segments, virtual volumes, and
physical volumes, if they are not already assigned to other
0 Aborts in user data write operations are rare-metadata sessions. Locks can then be taken on these objects during
changes will usually commit even if the write operation the I/O operation without a blocking hazard because the
ends in error objects have first been reserved to the session. If any of the
0 Storage media maintenance activities such as tape resources naded by an I/O operation are already reserved
repacking and reclaiming are always going on in another session, the I/O operation is blocked until the
session holding the resources completes. This prevents 110
Dealing with long-durationtransactions transactions from blocking on metadata locks and prevents
deadlocks that would occur if two or more transactions
In order to make transactional changes to the HPSS meta- competed for shared metadata records.
data, the HPSS servers use SFS to create locks on records The second problem in dealing with long-duration I/O
that are being changed. The locks implement the isolation transactions is the issue of dealing with failed transactions
property of Encina transactions and are maintained on aper after server CTashes, host system failures, power failures,
metadata record basis by SFS. If a transaction locks a meta- and the like. For reasons such as these, a transaction will
datarecord with theintent of writing therecord (write-lock), occasionally fail to complete. The Structured File Server,
all other transactions are blocked from reading or writing which acts as the transaction coordinator in HPSS, main-
the record. They may not read the record unless they are tains the integrity of the HPSS database by aborting the
willing to read it without a lock, which allows them to read transaction. This is done by recording a time of last use
uncommitted data, if the record has been modified by an in each of the: Open File Descriptors open on HPSS meta-
uncommitted transaction. data files. If the time since an OFD was last used exceeds
In many typical OLTP applications transactions are of a configurable limit set in SFS (the SFS OFD idle time-
short duration. In a mass storage system some transactions out), and another transaction attempts to take a lock on a
may be very short while others may be very long in duration. record locked by the expired O m , the expired transaction
Short transactions include operations that simply perform is aborted andl the expired OFD is closed. This feature pro-
changes to the system metadata (for example, set attributes tects the HPS S database from the failures noted above and
functions) while long-duration transactions are those that is an important reason for using an OLTP in a mass storage
perform I/O to the archive. system.
Holding locks on HPSS metadata for long periods of If the transactions in the HPSS database could be char-
time while long-duration U 0 transactions are in effect cre- acterized as not exceeding in duration some upper bound,
ates two potential problems for HPSS. The first is blocking the SFS idle time-out value could be set to that upper bound.
other transactions’ access to metadata records. While the However, when doing I/O operations on files of arbitrary
isolation property of transactions is vital to maintaining the length, as HPSS is designed to do, an upper bound can-
consistency of the HPSS database, locks held on metadata not be established. Small values of the SFS idle time-out
records by long-duration I/O transactions could cause other would effectively limit the size of files that could be written
transactions to block for long periods of time while trying to into HPSS. Large values would have an undesirable effect
perform non I/O operations such as allocating storagespace. on recovering HPSS after a failure. One would either have
HPSS resolves this problem by imposing restrictions to wait a long time for the SFS idle time-outs to expire
in the way storage objects are used and reserving objects on the transactions in effect at the time of the failure, or
to active I/O operations. For example, files must be opened reinitialize SFS (which increases the time to reinitialize the
through the bitfile server prior to requesting a read or write system). Furthermore, for whatever practical value the SFS
operation. When 1/0 begins on the file, the file is reserved idle time-out is set to, an 1/0operation that takes longer
to the client until the U 0 completes. If another client at- can be imagined. Note that time spent waiting for storage
tempts an I/O operation, it is blocked until the first client’s volumes to bt: mounted in a busy system may be spent
operation finishes and the transactions created during the inside the context of a transaction, further increasing the
I/O operation have resolved. likelihood of an aborted transaction.
51
Fourteenth IEEE Symposium on Mass Storage Systems
HPSS resolves this situation by employing a technique To resolve this problem, the HPSS bitfile server allo-
suggested by developers at Transarc. When OFDs are as- cates storage segments in a top-level transaction, writes a
signed to atransaction, abackground thread in the serverpe- log with the segment-identification information, and com-
riodically performs a benign SFS operation using the OFD. mits the transaction. Then it starts a second top-level trans-
These operations take very little time and don’t change the action, writes the user data, links the storage segment to
state of locks on metadata records, but do update the time the file metadaa and deletes the log entry. In the same
of last reference in the OFD. The period of this “keep- transaction, the storage server updates its metadata with
alive” operation is set to a value that is most of the OFD information about the written length of the segment. The
idle time-out, but allows a margin of safety. As long as first transaction creates the storage segment and releases
the server is functioning, its transactions and OFDs stay the storage map, the second writes the user data, updates
valid, and U 0 operations can take as much time as they the storage-segmentmetadata and attaches the storage seg-
need to complete. Locks on system metadata remain valid, ment to the file metadata. If an error occurs during the file-
competing accesses to the locked records remain blocked, write operation, we delete the storage segment in a separate
and the long duration U 0 operations will complete nor- transaction. We use two transactions where we originally
mally. expected to use one, so that locks on metadata objects are
not held for long periods of time, blocking the use of the
Choosing transaction boundaries objects by other users.
In another case, a deadlock that occurred (when two or
In a few instances during the implementation of HPSS, more HPSS clients were reading files located on the same
we located the boundaries of transactions in places differ- virtual volume) was broken by changing a subtransaction
ent from those described in the system design. Top-level into a top-level transaction.
transactions were used in some places that were originally To carry out a file-read operation, the bitfile server
implemented as subtransactions. In other cases, operations creates a top-level transaction in which the file statistics
that seem logically to be a single transaction were imple- changes are made. When two clients accessed two files, lo-
mented in two transactions. cated on a shared virtual volume, a deadlock occurred. One
A good example of this appears in the bitfile server. client would mount and read one volume, and in so doing
There are two steps in the loop that processes user write would cause subtransactions to be started that made minor
functions. In the k s t step, storage space is allocated; in metadata changes associated with mounting the volume
the second step, the space is written with the user data. (time last mounted, number of mounts). The other client
However, there are a few rare errors that can occur in the would do likewise with a second volume. Then each client
second step that make it desirable to discard the space allo- would attempt to mount and read the volume previously
cated in the first step. The two steps logically form a single read by the other. Since each client is holding a lock on the
transaction, but in the implementation of the function, two volume metadata needed by the other, in an uncommitted
transactions are used. If the function could be performed subtransaction, neither could proceed.
in one transaction, space allocated in the first step could be One solution to this deadlock is not to make the meta-
deallocated by simply aborting the transaction. data changes associated with mounting the volumes. An-
However, since the second step is sometimes a long other is to make the changes in a top-level transaction. We
duration operation, a single transaction that includes the chose the second route, recognizing that it is appropriate
storage-allocation step will hold a storage map locked dur- to record the metadata changes associated with mounting
ing the entire write operation, In the HPSS tape storage the volume regardless of the outcome of the read operation.
server, this is a tolerable situation because storagemaps are The physical act of mounting the volume cannot be undone-
kept in a busy state during tape writes. When a tape is being done; therefore, the records should be kept. The volume
written, no other storage space can be allocated on the tape mount statistics are updated in a top-level transaction em-
until the write operation completes. bedded in the bitfile server’s data-reading transaction.
In the HPSS disk storage server, however, holding the
storage map busy during a long disk-write is undesirable Dealing with transaction side-effects
because storage space for disk segments is allocated in
finite sized blocks, rather than in an open-ended fashion as In a distributed OLTP system, transactions take a certain
for tape. Disk storage maps are therefore not kept in a busy amount of time to become completely finished after they
state while the segment is being written. We want to allocate commit. In SFS, locks on records may be held for a short
the disk space, modify the disk storage map, and commit time until the server finishes the transaction. During this
those modifications as soon as possible so that other storage time, searching a system metadata file with a nonunique
segments may be allocated from the map. secondary key can yield unexpected results.
52
Fourteenth IEEE Symposium on Muss Sto.rage Systems
In the HPSS storage server, we index storage maps by a to meet the requirements of both the mass storage system
primary key, which is unique for each map, and a secondary and the OLTP system.
key, which may be shared by many maps. The state of the Distributed mass storage systems present unique is-
map (free, busy, full, and such) is part of the secondary sues and ch,allenges in the application of OLTP technol-
key. When the server creates a storage segment, it searches ogy. Howevlq given sufficient capability from the OLTP
the map metadata file by secondary key, searching for free product, these issues and challenges can be successfully
maps. managed and overcome.
While testing the segment-creation logic in the tape This work was performed, in part, by the Lawrence
storage server, we discovered that a rapid serial sequence of Livermore National Laboratory under the auspices of the
storage-segment creation requests were resolved on a series US Department of Energy under contract No. W-7405-Eng-
of volumes, not on a single volume as expected. If only one 48, Los Alamos National Laboratory, Oak Ridge National
volume were available in the system, the sequence failed Laboratory and Sandia National Laboratories, under aus-
with a “no space” error when, in fact, space was available. pices of the US Department of Energy Cooperative Re-
The key to understanding this effect is to recoglaize search arid Development Agreements, by Comell, Lewis
that distributedtransactions take time to resolve. When the Research Center and Langley Research Center under aus-
storage server changed a storage map’s state from busy to pices of the National Aeronautics and Space Administra-
free in one transaction, then searched the metadata file for tion and by IBM Government Systems under Independent
free maps in a new transaction very shortly thereafter, it Research and Development and other internal funding.
failed to find the map modified in the first transaction. It
then either allocated space on a different volume, or, if no References
other volume was available, returned the “no space” error.
This situation is resolved in HPSS in two ways. First, S . Dietzen and A. Spector, “Distributed Transaction Sys-
the problem resolves itself if requests to create tape storage tems,” Distributed Computing Environments, McGraw-Hill,
segments are separated by enough time for the transac- pp. 223-257,1993.
tions to finish in SFS. Second, if a storage server client D. Cerutti, “The Rise of Distributed Computing,’’ Distributed
wishes to create a string of segments on the same volume, Computin,gEnvironments,McGraw-Hill, pp. 3-8, 1993.
an option is provided in the segment creation function that
J. Gray and A. Reuter, TransactionProcessing: Concepts and
causes the storage server to read the desired storage map
Techniques,Morgan Kaufmann, 1993.
directly, by primary key, rather than indirectlysearching for
a free volume by secondary key. The direct-read operation R. Orfali, D. Harkey, and J. Edwards, The Essential
blocks until any transactional activity on the desired record ClientIServer Survival Guide, Van Nostrand Reinhold, 1994,
is resolved. The delay is minimal and the system maintains p. 242.
transactional integrity in its metadata changes. R.A. Coyne, H. Hulen, and R.W. Watson, “The High Perfor-
mance Storage System,” Pmc. Supercomputing93, Portland,
Conclusions OR, IEEE Computer Society Press, Nov. 1993.
Transarc Corporation, Encina Product Overview and Encina
Commercial OLTP technology can provide the ACID prop- Product Documentation, 1992.
erties required for new distributed mass storage systems in IEEE Storage Systems Standards Working Group (SSSWG)
managing distributedmetadata information. HPSS is an ex- (Project 1244), “Reference Model for Open Storage Systems
ample of a new storage system designed to take advantage Interconnection, Mass Storage Reference Model Version 5,”
of an existing OLTP product. In doing so, development and Sept 1994. Available from the IEEE SSSWG Technical Edi-
maintenance costs have been greatly reduced while overall tor, Richard Garrison, Martin Marietta, (215) 532-6746.
system reliability has been enhanced. IBM Corporation, Encina Transactional-C Programmer’s
We have shown examples of the use of OLTP systems Guide and Referencefor Am, SC23-2465-02,1994.
in the implementation of a high-performance mass storage
system, and shown how those algorithms can be modified
53