Technical Report DHPC-062
January 1999
Distributed and Hierarchical Storage Systems
Craig J. Patten, K. A. Hawick, J. F. Hercus and A. L. Brown
Advanced Computational Systems Cooperative Research Centre
Department of Computer Science, University of Adelaide
SA 5005, Australia
Email: fcjp,khawick,james,fredg@cs.adelaide.edu.au
Abstract
We discuss issues and technologies for implementing and applying distributed, high-performance storage systems. We
review a range of software systems for distributed storage management, and summarise what we believe to be the most
important issues and the outstanding problems in distributed storage research. We outline our vision for integrated
storage management that is compatible with our DISCWorld wide-area service-based metacomputing environment.
Very large on-line archives present a challenge and our driving force for developing a distributed storage system. In
addition, robotic tape silos form the lowest level of many of these archives. Their integration into the storage hierarchy
is often non-trivial, proprietary and expensive. We describe how our system can provide the necessary infrastructure
for such applications.
Keywords: distribution; NFS; storage; tape silo; metacomputing.
1 Introduction
Various distributed computing technologies have enabled greater integration of high-performance computing, storage
and visualisation resources distributed across wide-area networks (WANs). These integrated systems are often termed
metacomputing systems [5]. In the storage realm, these metacomputing systems present many issues not addressed
through existing distributed storage systems or remote data access mechanisms. Distributed file systems predominantly address “everyday” file access over a relatively local area, and most mechanisms for remote data access in
metacomputing systems focus too closely on the simple client-server case. Wide-area storage management, hierarchical storage access, network latency and complexity, and support for legacy/commercial and custom applications are
examples of issues not suitably covered by these systems.
Within the DISCWorld [8] metacomputing infrastructure, we are developing software technology to present storage as
a service within the system - the DISCWorld Storage Service (DSS) [14]. However, access will not be limited to other
components in the metacomputing system; legacy or commercial applications not specifically designed for use within a
metacomputing system will also be able to utilise the service. The DSS is designed to provide a metacomputing storage
service which is scalable, reliable and portable without the need for operating system or application modifications, and
which is also latency-tolerant and flexible and adaptive in its use of storage and network resources.
In Section 3, we outline the relevant technical issues for distributed storage and in section 4 we discuss how hierarchical storage systems add to latency problems. In section 5 we describe our efforts in the area and status of our
implemented DWorFS system and partial implementation of a full storage service. We summarise and detail future
plans in Section 6.
1
2 Target Application Characteristics
Storage in the form of a filesystem is a general requirements of many computer applications, but the access patterns
for distributed storage applications are somewhat more selective. We are targeting those applications and user communities that truly need distributed access to large amounts of online storage. A particular example include users and
applications for geospatial data analysis.
Geospatial data applications often need to access several different sources of data, that may include digital terrain
maps, satellite and air-reconnaissance imagery, other vector or cadastral data such as road and utility networks or
other measured properties such as population statistics or agricultural properties of a region. Often some of this data
is available to the user and will be owned by that user. However the bulk archives of satellite imagery or government
survey data is often not owned by a given user or his organisation and is very often managed as a remote archive. Until
recently it was common for data sets to be bought and shipped on tape or CD-ROM to users and accessed locally. The
availability of high performance networks and Internet connectivity between commercial organisations means that it
is now more common to want to access remote data without the inconvenience and delay of shipping a tape copy and
unloading it locally.
We believe a storage system can combine the best features of distribution and a hierarchy of different storage media
to provide a good compromise of cost effectiveness and performance for bulk storage systems distributed nationally
and even internationally. Suitable interoperating software to manage distributed bulk storage facilities will allow
data custodians to manage their own data collections more effectively and provide their users better access to the
collections. We discuss this issue in more depth in[7].
We are developing a distributed storage systems experiment for the Australian Bureau of Meteorology, that involves
connecting sites in Adelaide, Melbourne and Canberra using broadband networking technology to investigate efficient
ways to stage delivery of data from a central site to regional nodes. Individual sites may have their own local data stored
on a hierarchical media storage device such as combined tape silo and array of disks. It is therefore advantageous to
have the storage management software at each site interoperate to make best use of the full capacity of the distributed
system. This project involves geospatial data that is stored as hyperbricks containing several timeslices and different
variables. The seek time, as discussed in section 4, for a particular temporal-spatial slice for a particular application is
therefore non-trivial. The software issues for interconnecting the different sites is also of prime consideration and we
discuss the issues for this in section 3 below.
In general we are targeting applications that may be run semi-interactively and which may need to involve both new
components as well as legacy applications for which the source code is either unavailable or for which it is not feasible
to modify it. A distributed software infrastructure that uses a conventional file system interface is therefore highly
advantageous. This is what is provided by our DWorFS system. Adding in new functionality and modules to allow
control of prefetching and other performance enhancements is what we are targeting with our general DSS as described
in section 5.
3 Issues and Technologies for Distribution
Network latency is an important issue in wide-area distributed computing; the speed of light places a fundamental limit
on latency reduction. Round-trip times between geographically separated sites of the order of hundreds of milliseconds
and upwards are not uncommon. It is therefore important in the design of any distributed system to amortize the cost
of this latency into bulk-data transfers. Some work to improve existing network/distributed file system protocols is
being initiated, for example Network File System (NFS) [17] Version 4 [10] and WebNFS Multi-Component Lookup
(MCL) [18]. The VAFS project [2] is augmenting the Andrew File System (AFS) [15] to allow bulk-data transfers
directly over the Asynchronous Transfer Mode (ATM) Adaptation Layer 5 (AAL5). The WebOS project [19] has
implemented a loadable kernel module for Solaris which enables access to HTTP [3] servers through the file system
namespace for their wide-area storage infrastructure. However, whilst each of these systems have their merits, there
are still many issues associated with the wide-area metacomputing storage problem which they do not address.
Network complexity is also of increasing importance in today’s high-performance computing environments. Sites
often have access to a variety of networks, from the “standard Internet”, to high-performance WANs. Some of the
access to these networks is permanent, sometimes it is part-time or on-demand for specific experiments. Storage is not
immune to this variety either; the range of media in the storage hierarchy and the many different ways of organising
data on these media places demands on the flexibility of a storage system if it is to perform well. There are many
parameters associated with local and remote storage resources, for example media and network latency and bandwidth
and optimal layout strategies for differing datasets and applications. Distributed systems must therefore be flexible
in their use of such resources and dynamically reconfigurable to handle changes in the resources available, or the
application thereof. Most existing distributed file system technology is lacking in this area and existing metacomputing
remote data access mechanisms do not address the issue.
As wide-area networking technology enables greater integration of geographically distributed resources, the collective
level of heterogeneity any distributed system must handle increases. Portability then becomes even more important
for any system wishing to function effectively over wide-area computing resources. Therefore it is important for a
storage service operating over such resources to provide access to applications without requiring modifications to the
operating system or applications themselves.
Metacomputing systems also present issues in electronic commerce, such as derived product ownership and storage
leasing. If a dataset is accessed through a storage system which provides some level of processing or value-adding, the
access of cached results could be tracked for billing purposes. The temporary leasing of storage resources must also
be handled in a similar fashion. Existing distributed file systems and metacomputing remote data access mechanisms
do not address these issues.
Some existing remote data access mechanisms in metacomputing systems, for example Remote I/O [6], support parallel I/O interfaces, however these systems do not address the broader issue of distributed storage management across
the wide area. Some distributed file systems are more sophisticated in their use of distributed storage resources, but
do not support parallel I/O. Support for the complex data movements that can occur across a metacomputing system
must be tied into the storage service of such a system for maximum effectiveness.
4 Hierarchical Storage and Tape Silos
Robotic tape silos form the lowest level of many large scale on-line and near-line data storage and backup systems.
They are used to provide economically viable solutions for the storage of large data sets which can not economically
be stored on disk. However, their integration into the storage hierarchy is anything but ubiquitous or standardised,
and is invariably proprietary and expensive (eg. SAM-FS [12]), application specific and data specific. This also often
requires manual intervention. Integrating tape silo storage into a distributed data store involves addressing the inherent
limitations of this type of storage. These limitations are of two types.
The first is the limitation on reliability and convenience in writing to tapes rather than disks. Usually, it is only possible
to write to the end of a tape; writes elsewhere destroy all subsequent data. Unless carefully managed this property
will lead to fragmentation of the data written to the tapes; a problem which is rendered more serious than it is for disk
systems because of the high latency characteristics of tape silos.
Latency or performance overhead is the second primary, and more serious, limitation of tape silo storage. It can be
divided into three types. Firstly there is the latency of loading tapes and preparing them for access (tape tensioning,
mounting). For the StorageTek TimberWolf 9740 silo using Redwood SD-3 drives [16], this time is around 40 seconds,
provided there is an available tape drive and the robot arm is free. The delay is considerably longer if there is no
available tape drive. This latency can be expected to improve in the future; however, being a large scale (in computing
terms) mechanical procedure, it will always be enormous relative to computer speeds. The other types of latency are
due to the intrinsic nature of tapes. The first is fast forwarding time. This is the time it takes to seek through the tape
to the beginning of the file containing the required data. A fair estimate of the latency imposed by this step is the time
to seek half way through a tape from the beginning. For a 50G tape in an SD-3 drive this figure is 53 seconds. The
final type of latency is the time required to seek inside the file for the required data. This time is determined by the
read rate of the tape drive (11MB/s for the SD-3) and the file offset. With careful organisation of data on the tapes this
latency should be minimal.
These stated latencies can be expected to improve in the future, but are likely to remain the primary issue to be
addressed when integrating tape silo storage into a larger distributed data store. There are a few strategies which can
be employed to address the problems posed by these high latencies. The first, and easiest, is to use caching to reduce
the frequency of accesses to the silo. The simple way to add caching to a tape silo is to use a disk or RAID attached
to the same host as the silo. In a distributed system the caching can be expanded to use multiple machines and also
by caching data closer to where it is used. Other strategies for reducing the effect of latency involve management
of the data stored on the silo. This assumes that the data has some high level structure or organisation and that the
access patterns to the data are structured. The problem is to map the data set onto the silo in a manner which increases
locality of access, thereby maximising cache effectiveness. The potential benefits of carefully arranging data on the
silo are considerable. For example, using the silo and tapes mentioned above the average time to seek between two
data items on separate tapes is about 150 seconds; for items on the same tape it is around 20 seconds which can be
further reduced by optimising the layout of data on the tape.
The latency problem may be addressed by preempting data requests. This allows some or all of the latency costs
to be incurred before the request is actually made. In a wider system this strategy can be extended to prefetching
data to parts of the system where it will, or might be, needed. There are two ways to acquire the knowledge about
future requests which is required to preempt operations; automatic and manual preempting. Automatic preempting
involves predicting future requests by extrapolating from current activity and analysing historical access patterns. The
performance of this approach will be determined by the type of application (eg. how structured its data access pattern
is), the level of sophistication of the prediction algorithm, and the amount of cache space available. Manual preempting
means asking the user of the system or application to specify, in advance, the data required for the application they are
using. This option is capable of the best performance but has the weakness of relying on user input which may not be
available.
5 Our Vision for Distributed Storage
Our initial exploration into metacomputing storage mechanisms produced the DISCWorld File System (DWorFS) [13],
an extensible daemon which provided a NFS interface to arbitrary underlying storage mechanisms. A DWorFS daemon accepts NFS requests as per a normal NFS server, however underlying dynamically-loaded modules provide the
storage functionality. This allows users to access data from “virtual files” which may not, and may never, fully exist
on any storage medium - the modules simply provide the NFS clients with dynamically-generated directory structure
and file data in response to requests. This enables on-demand production and processing of data products in a manner
which is transparent and portable. Modules built to interface to DWorFS thus far provide access to a GMS-5 [11, 13]
satellite imagery repository and a persistent object store [4]. We are currently planning modules for handling other
large data archives such as meteorological model data hyperbricks.
DWorFS
NFS
DSS
Dataset Modules
(GOES-9, ...)
Clients
Other
DSS
Peers
Storage Manager
DSS
DISCWorld Node
Figure 1: Overview of the DISCWorld Storage Service architecture, illustrating the modular design and communications infrastructure.
The DSS, illustrated in Figure 1, builds on our DWorFS work by taking the concept of providing data in potentia
through a user-space NFS server to construct a decentralised metacomputing storage service. Each node offering
storage services in the DISCWorld Storage System must run a DSS daemon, which as for DWorFS, provides an NFS
interface.
For scalability and reliability, we have designed the DSS to be completely decentralised. A client wishing to access
the DSS infrastructure must simply use NFS to access its nearest DSS node, which will then communicate with its
peers to arrange all requested data transfers. Clients running a Unix-variant operating system can themselves run a
DSS server, accessing it across their loopback network interface, to remove the need to rely on another host for access
to the “cloud” of DISCWorld storage resources.
The DSS, unlike general distributed file systems, has some level of understanding of the data products it provides.
As in DWorFS, modules respond to client requests for directory structures and file data. For example, all requests
for pathnames within the gms5 directory may be fielded to the GMS-5 dataset module for processing. The DSS
provides an underlying storage and distribution API for the modules, through the the DSS Storage Manager. As with
the DWorFS layer in the DSS, the Storage Manager also utilises dynamically-loaded modules; in this case, to provide
access to some (possibly remote) storage mechanism. For example, to provide access to an actual filesystem on disk,
a tape silo, or even a remote HTTP server. These modules must provide at least a core minimum of services to the
Storage Manager, but can provide their own extensions if so desired. This allows the Storage Manager to provide
dataset modules with uniform access to a variety of storage mechanisms, with the capability for providing custom
features. For example, a satellite image dataset module may make use of extended tiling and data layout features of a
specific raw disk storage module.
Through these storage modules, DSS supports the integration of arbitrary levels of the storage hierarchy, for example
robotic tape silos. Typically, the software technology used to access these silos is not ubiquitous or standard, and
is proprietary and expensive. We have implemented a simple storage module to access a StorageTek TimberWolf
9740 [16] tape silo, however arbitrary functionality, such as caching and prefetching, could be implemented through
the storage module API.
Distribution in the DSS is handled through a storage module providing a communications protocol, the InterDSS
protocol, for use between DSS nodes. Specifically designed for bulk-data transfer, this protocol provides a mechanism
for efficient data transfers between DSS nodes, whilst also providing a higher-level API for flexible access to storage
resources on remote DSS nodes.
We are still in the implementation phase of the DISCWorld Storage Service. The DWorFS layer has been implemented,
as have some simple dataset and storage modules. Some initial performance measurements have been produced [4]
for the DWorFS persistent object store module which are unfortunately inconclusive relative to the Solaris 2.5 NFS
server. The DWorFS implementation performance suffers due to the blocksize and synchronous write limitations of
NFSv2, but in some write-heavy operations outperforms the Solaris server when the module minimises stable store
write activity. We are presently reviewing the option to meet full NFSv3 capabilities.
6 Summary and Future Work
Existing approaches to wide-area distributed and hierarchical storage in a metacomputing system do not sufficiently
address the unique issues and problems that arise in such an environment. For example, network latencies and complexities, storage hierarchy integration and the need for ubiquitous access mechanisms.
We have outlined the DISCWorld Storage Service, a prototypical metacomputing storage service designed to address
some of the open research issues. These issues include: scalability, reliability, portability, latency-tolerance and
flexibility and adaptivity in use of storage and network resources. In our planned system the use of hierarchical storage
media with a spectrum of additional latency properties presents similar problems to those inherent in distribution over
wide areas. The latency overheads can in fact be modeled as another layer in the hierarchy of access latencies. We
have completed the DWorFS NFS interface component of our DSS vision, although it remains to decide whether the
use of NFSv3 capability is required. Initial performance results from applying DWorFS to persistent object stores are
promising.
Immediate future work includes completing the InterDSS protocol and Storage Manager development. After this, we
plan to design and implement several more dataset and storage modules and to experiment with the various system
design parameters to gain a fuller understanding of the issues involved. Future plans also include enabling the DWorFS
layer to be extensible in the interfaces it provides, and to use this functionality to integrate the Storage Service with
DISCWorld itself.
Acknowledgments
This work was carried out under the Distributed High Performance Computing Infrastructure Project (DHPC-I) of the
On-Line Data Archives Program (OLDA) of the Advanced Computational Systems (ACSys) [1] Cooperative Research
Centre (CRC) and funded by the Research Data Networks (RDN) CRC. ACSys and RDN are funded by the Australian
Commonwealth Government CRC Program.
References
[1] Advanced Computational Systems Cooperative Research Centre, http://acsys.adelaide.edu.au.
[2] “AFS for Very High Speed Networks”, Charles J. Antonelli, Center for Information Technology Integration,
University of Michigan, October 1998, http://www.citi.umich.edu/projects/vafs.
[3] “Hypertext Transfer Protocol – HTTP/1.0”, T. Berners-Lee, R. Fielding and H. Frystyk, RFC 1945, May 1996.
[4] “Utilising NFS to Expose Persistent Object Store I/O”, A. L. Brown, Submitted to 6th IDEA Workshop, Rutherglen, January 1998.
[5] “Metacomputing”, C. Catlett and L. Smarr, CACM, 35 (1992), pp. 44-52.
[6] “Remote I/O: Fast Access to Distant Storage”, I. Foster, D. Kohr, R. Krishnaiyer, J. Mogill, Proc. Workshop on
I/O in Parallel and Distributed Systems, pp. 14-25, 1997.
[7] “Interfacing to Distributed Active Data Archives”, K.A.Hawick and P.D.Coddington, November 1998, To appear
in International Journal on Future Generation Computer Systems.
[8] “DISCWorld: An Environment for Service-Based Metacomputing”, K. A. Hawick, P .D. Coddington,
D. A. Grove, J. F. Hercus, H. A. James, K. E. Kerry, J. A. Mathew, C. J. Patten, A. J. Silis and F. A. Vaughan,
Future Generations of Computer Science Special Issue on Metacomputing, and Technical Report DHPC-042,
April 1998.
[9] “Managing Distributed, High-Performance Storage Technology”, K. A. Hawick and Craig J. Patten, Technical
Report DHPC-054, September 1998.
[10] “Network File System Version 4 (NFSv4) Working Group Charter”, Internet Engineering Task Force (IETF),
http://www.ietf.org/html.charters/nfsv4-charter.html.
[11] “GMS User’s Guide”, Japanese Meteorological Satellite Center, 2nd Ed., 1989.
[12] “SAM-FS”, LSC Inc., http://www.lsci.com/lsci/products/samfs.htm.
[13] “DWorFS: File System Support for Legacy Applications in DISCWorld”, Craig J. Patten, F. A. Vaughan,
K. A. Hawick and A. L. Brown, Proc. 5th Integrated Data Environments Workshop, Fremantle, February 1998.
[14] “Towards a Scalable Metacomputing Storage Service”, Craig J. Patten, K. A. Hawick and J. F. Hercus, Technical
Report DHPC-058, November 1998.
[15] “The ITC Distributed File System: Principles and Design”, M. Satyanaranan, J. H. Howard, D. N. Nichols,
R. N. Sidebotham, A. Z. Spector, and M. J. West, Proc. Tenth Symposium on Operating Systems Principles, pp.
35-50, 1985.
[16] “StorageTek Hardware Products”, Storage Technology Corp.,
http://www.storagetek.com/StorageTek/hardware.
[17] “Network File System Version 3 (NFSv3) Specification”, Sun Microsystems, RFC 1813, June 1995.
[18] “WebNFS Server Specification”, Sun Microsystems, RFC 2055, October 1996.
[19] “WebOS: Operating System Services for Wide Area Applications”, Amin Vahdat, Thomas Anderson, Michael
Dahlin, Eshwar Belani, David Culler, Paul Eastham, and Chad Yoshikawa, Proc. Seventh High Performance
Distributed Computing Conference, Chicago, July 1998.